Re: [USRP-users] Transmit Thread Stuck Receiving Tx Flow Control Packets

Brian Padalino via USRP-users Mon, 08 Oct 2018 10:56:46 -0700

I have rebuilt the FPGA using Juan Francisco's suggestion in the DMA FIFO,
and since then I haven't run into the problem using UHD v3.13.0.1.


His patch was:

OUTPUT2: begin
// Replicated write logic to break a read timing critical path for
read_count
read_count <= (output_page_boundry < occupied_minus_one) ?
output_page_boundry[7:0] : occupied_minus_one[7:0];
-            read_count_plus_one <= (output_page_boundry <
occupied_minus_one) ? ({1'b0,output_page_boundry[7:0]} + 9'd1) : {1'b0,
occupied[7:0]};
+            read_count_plus_one <= (output_page_boundry <
occupied_minus_one) ? ({1'b0,output_page_boundry[7:0]} + 9'd1) : ({1'b0,
occupied_minus_one[7:0]} + 9'd1);

Brian

On Mon, Oct 8, 2018 at 1:28 PM Michael West <[email protected]> wrote:

> Hi Alan,
>
> Try increasing the TX ring buffer for the network interface and make sure
> the CPU governor is not throttling the CPU (i.e. set to "performance" and
> not "on demand" or "powersave").
>
> The samples per packet for TX was reduced because the larger frame size
> was actually resulting in even more underruns in testing.  That does mean
> more smaller packets are creating more load on the CPU to process the
> network interrupts and the above settings will help tune the system to
> better handle that load.  We are looking at ways to increase the TX frame
> size back to where it was and reduce that load, but it will take
> significant changes to accomplish that and those changes probably won't be
> available for a while.
>
> Regards,
> Michael
>
> On Wed, Sep 5, 2018 at 1:22 PM Alan Conrad via USRP-users <
> [email protected]> wrote:
>
>> I tried Brian’s suggestion to rebuild UHD and the FPGA off of the commits
>> he suggested (thanks Brian).  However, with this combination I am getting
>> significantly more underruns than I did previously, even with the benchmark
>> rate program.  Here’s the output of benchmark_rate that I got originally
>> with UHD 4.0.0.rfnoc-devel-788-g1f8463cc.
>>
>>
>>
>> ./benchmark_rate --rx_rate 100e6 --tx_rate 100e6 --channels="0,1"
>>
>>
>>
>> Benchmark rate summary:
>>
>>   Num received samples:     2016651900
>>
>>   Num dropped samples:      0
>>
>>   Num overruns detected:    0
>>
>>   Num transmitted samples:  2005972016
>>
>>   Num sequence errors (Tx): 0
>>
>>   Num sequence errors (Rx): 0
>>
>>   Num underruns detected:   562
>>
>>   Num late commands:        0
>>
>>   Num timeouts (Tx):        0
>>
>>   Num timeouts (Rx):        0
>>
>>
>>
>> And now I get this with UHD 3.14.0.HEAD-31-g98057752.
>>
>>
>>
>> Benchmark rate summary:
>>
>>   Num received samples:     2001309816
>>
>>   Num dropped samples:      0
>>
>>   Num overruns detected:    0
>>
>>   Num transmitted samples:  1841996424
>>
>>   Num sequence errors (Tx): 0
>>
>>   Num sequence errors (Rx): 0
>>
>>   Num underruns detected:   353655
>>
>>   Num late commands:        0
>>
>>   Num timeouts (Tx):        0
>>
>>   Num timeouts (Rx):        0
>>
>>
>>
>> One difference I did notice between these two versions of UHD is the
>> maximum samples per packet returned from the get_max_num_samps() function
>> for both the Rx and Tx streams.  With the version from the rfnoc-devel
>> branch, I get 1996 samples for both the Rx and Tx streams.  But, the UHD
>> 3.14.0 version gives 1996 samples for the Rx stream but only 996 samples
>> for the Tx stream.  I’m not sure if this is causing the additional
>> underruns.
>>
>>
>>
>> In any case, was a change made to limit the number of transmit samples
>> per packet?  Are there additional network configurations that I need to
>> make to increase the maximum samples per packet for the Tx stream or to
>> limit the underruns with these versions of UHD and the FPGA firmware?  BTW,
>> setting “spp” in the transmit stream args does not allow more than the 996
>> samples per packet.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Alan
>>
>>
>>
>>
>>
>> *From:* Brian Padalino <[email protected]>
>> *Sent:* Tuesday, August 28, 2018 8:57 PM
>> *To:* Alan Conrad <[email protected]>
>> *Cc:* [email protected]
>> *Subject:* Re: [USRP-users] Transmit Thread Stuck Receiving Tx Flow
>> Control Packets
>>
>>
>>
>>
>>
>> On Tue, Aug 28, 2018 at 4:02 PM Alan Conrad via USRP-users <
>> [email protected]> wrote:
>>
>> Hi All,
>>
>>
>>
>> I’ve been working on an application that requires two receive streams and
>> two transmit streams, written using the C++ API.  I have run into a problem
>> when transmitting packets and I am hoping that someone has seen something
>> similar and/or may be able to shed some light on this.
>>
>>
>>
>> My application is streaming two receive and two transmit channels, each
>> at 100 Msps over dual 10GigE interfaces (NIC is Intel X520-DA2).  I have
>> two receive threads, each calling recv() on separate receive streams, and
>> two transmit threads each calling send(), also on separate transmit
>> streams.  Each receive thread copies samples into a large circular buffer.
>> Each transmit thread reads samples from the buffer to be sent in the send()
>> call.  So, each receive thread is paired with a transmit thread through a
>> shared circular buffer with some mutex locking to prevent simultaneous
>> access to shared circular buffer memory.
>>
>>
>>
>> I did read in the UHD manual that recv() is not thread safe.  I assumed
>> that this meant that recv() is not thread safe when called on the same
>> rx_streamer from two different threads but would be ok when called on
>> different rx_streamers.  If this is not the case, please let me know.
>>
>>
>>
>> On to my problem…
>>
>>
>>
>> After running for several minutes, one of the transmit threads will get
>> stuck in the send() call.  Using strace to monitor the system calls it
>> appears that the thread is in a loop continuously calling the
>>
>> poll() and recvfrom() system calls from within the UHD API.  Here’s the
>> output of strace attached to one of the transmit threads after this has
>> occurred.  These are the only two system calls that get logged for the
>> transmit thread once this problem occurs.
>>
>>
>>
>> 11:19:04.564078 poll([{fd=62, events=POLLIN}], 1, 100) = 0 (Timeout)
>>
>> 11:19:04.664276 recvfrom(62, 0x5619724e90c0, 1472, MSG_DONTWAIT, NULL,
>> NULL) = -1 EAGAIN (Resource temporarily unavailable)
>>
>> 11:19:04.664381 poll([{fd=62, events=POLLIN}], 1, 100) = 0 (Timeout)
>>
>> 11:19:04.764600 recvfrom(62, 0x5619724e90c0, 1472, MSG_DONTWAIT, NULL,
>> NULL) = -1 EAGAIN (Resource temporarily unavailable)
>>
>> 11:19:04.764699 poll([{fd=62, events=POLLIN}], 1, 100) = 0 (Timeout)
>>
>> 11:19:04.864906 recvfrom(62, 0x5619724e90c0, 1472, MSG_DONTWAIT, NULL,
>> NULL) = -1 EAGAIN (Resource temporarily unavailable)
>>
>>
>>
>> This partial stack trace shows that the transmit thread is stuck in the
>> while loop in the tx_flow_ctrl() function.  I think this is happening due
>> to missed or missing TX flow control packets.
>>
>>
>>
>> #0  0x00007fdb8fe4fbf9 in __GI___poll (fds=fds@entry=0x7fdb167fb510,
>> nfds=nfds@entry=1, timeout=timeout@entry=100) at
>> ../sysdeps/unix/sysv/linux/poll.c:29
>>
>> #1  0x00007fdb9186de45 in poll (__timeout=100, __nfds=1,
>> __fds=0x7fdb167fb510) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
>>
>> #2  uhd::transport::wait_for_recv_ready (timeout=0.10000000000000001,
>> sock_fd=<optimized out>) at
>> /home/aconrad/rfnoc/src/uhd/host/lib/transport/udp_common.hpp:59
>>
>> #3  udp_zero_copy_asio_mrb::get_new (index=@0x55726266f6e8: 28,
>> timeout=<optimized out>, this=<optimized out>)
>>
>>     at /home/aconrad/rfnoc/src/uhd/host/lib/transport/udp_zero_copy.cpp:79
>>
>> #4  udp_zero_copy_asio_impl::get_recv_buff (this=0x55726266f670,
>> timeout=<optimized out>) at
>> /home/aconrad/rfnoc/src/uhd/host/lib/transport/udp_zero_copy.cpp:226
>>
>> #5  0x00007fdb915d48cc in tx_flow_ctrl (fc_cache=..., async_xport=...,
>> endian_conv=0x7fdb915df600 <uhd::ntohx<unsigned int>(unsigned int)>,
>>
>>     unpack=0x7fdb918b1090
>> <uhd::transport::vrt::chdr::if_hdr_unpack_be(unsigned int const*,
>> uhd::transport::vrt::if_packet_info_t&)>)
>>
>>     at
>> /home/aconrad/rfnoc/src/uhd/host/lib/usrp/device3/device3_io_impl.cpp:345
>>
>>
>>
>> The poll() and recvfrom() calls are in the
>> udp_zero_copy_asio_mrb::get_new() function in udp_zero_copy.cpp.
>>
>>
>>
>> Has anyone seen this problem before or have any suggestions on what else
>> to look at to further debug this problem?  I have not yet used Wireshark to
>> see what’s happening on the wire, but I’m planning to do that.  Also note
>> that, if I run a single transmit/receive pair (instead of two) I don’t see
>> this problem and everything works as I expect.
>>
>>
>>
>> My hardware is an X310 with the XG firmware and dual SBX-120
>> daughterboards.  Here are the software versions I’m using, as displayed by
>> the UHD API when the application starts.
>>
>>
>>
>> [00:00:00.000049] Creating the usrp device with:
>> addr=192.168.30.2,second_addr=192.168.40.2...
>>
>> [INFO] [UHD] linux; GNU C++ version 7.3.0; Boost_106501;
>> UHD_4.0.0.rfnoc-devel-788-g1f8463cc
>>
>>
>>
>> The host is a Dell PowerEdge R420 with 24 CPU cores and 24 GB ram.  I
>> think the clock speed is a little lower than recommended at 2.7 GHz but
>> thought that I could distribute the work load across the various cores to
>> account for that.  Also, I have followed the instructions to setup dual 10
>> GigE interfaces for the X310 here,
>> https://kb.ettus.com/Using_Dual_10_Gigabit_Ethernet_on_the_USRP_X300/X310
>> <https://protect-us.mimecast.com/s/MKjlC31KyLFp6Lm5cgYuvL?domain=kb.ettus.com>
>> .
>>
>>
>>
>> Any help is appreciated.
>>
>>
>>
>> I think you're hitting this:
>>
>>
>>
>>   https://github.com/EttusResearch/uhd/issues/203
>>
>>
>>
>> Which is the same thing that I hit.  I tracked it down to something
>> happening in the FPGA with the DMA FIFO.
>>
>>
>>
>> I rebuilt my FPGA and UHD off the following commits, which switch over to
>> byte based flow control:
>>
>>
>>
>>   UHD commit 98057752006b5c567ed331c5b14e3b8a281b83b9
>>
>>   FPGA commit c7015a9a57a77c0e312f0c56e461ac479cf7f1e9
>>
>>
>>
>> And the problem disappeared for the time being.  The infinite loop still
>> exists as a potential issue, but it seemed whatever was causing the lockup
>> in the DMA FIFO disappeared or at least couldn't be reproduced.
>>
>>
>>
>> Give that a shot and see if it works for you, or if you can still
>> reproduce it?  We never got to the root cause of the problem.
>>
>>
>>
>> Brian
>> _______________________________________________
>> USRP-users mailing list
>> [email protected]
>> http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com
>>
>

_______________________________________________
USRP-users mailing list
[email protected]
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

Re: [USRP-users] Transmit Thread Stuck Receiving Tx Flow Control Packets

Reply via email to