Re: [USRP-users] IOError: x300 fw poke32 - reply timed out

Marcus Müller via USRP-users Tue, 04 Dec 2018 02:10:22 -0800

oh wow, that's interesting, though I have little idea how to
reconciliate storage backpressure with dropping out of networking.
My best guess is that the storage system pushing back on too much data
somehow causes additional CPU load (especially: interrupts/context
switches), and that worsens the situation somehow so much that network
stacks get confused.
But that's exactly that: a wild guess in the blue.


Thus: Happy to hear you've got it working!

By the way, what's your kernel's write buffer size? In a pinch, the
following (single line) command should spit out the size in Gigibytes:

echo $(( $(sed -n -e 's/^MemTotal:[[:space:]]*\([[:digit:]]*\).*/\1/p'
/proc/meminfo) * $(sysctl -n vm/dirty_ratio) / 100. / (2**20) ))

If that is less write buffer than you expected, try upping up the
vm.dirty_ratio (i.e. percentage of RAM pages allowed to get filled
before flushing the write cache to disk is enforced and file write
operations become blocking) to something more generous – if your
system's main purpose is to write stuff to disk, and it will still
function well with the remaining 10%, try `sysctl -w vm.dirty_ratio=90`
and see whether that "evens out" the write peaks a bit.

Best regards,
Marcus  

On Tue, 2018-12-04 at 08:19 +0100, Stefan van der Linden wrote:
> It tooks us a while, but we seem to have found the root cause of the 
> issue. The RAID array which was being fed the data was able to just
> cope 
> with the dataflow peaks, although the mean dataflow was not a
> problem. 
> Therefore, when some kind of additional process fired up or some
> other 
> imperfection, it caused the buffer being used to overflow in some
> cases. 
> We were able to prevent this from happening by adding an SSD-based
> write 
> cache.
> However, we still don't understand why this effectively caused the
> X300 
> and/or the NIC to lock up, although I'm glad the problem is gone.
> 
> Kind regards,
> 
> Stefan
> 
> On 27/09/2018 11:45, Stefan van der Linden wrote:
> > Hi Marcus,
> > 
> > we've gone and updated UHD (HEAD of remotes/origin/UHD-3.13) and 
> > changed the MTU to 8000, unfortunately the problem still persists.
> > A 
> > TCP dump as discussed before is downloadable via: 
> > https://we.tl/t-zTKY2iHAlK.
> > Note that 192.168.50.1 is the host and 192.168.50.2 is the X300.
> > The 
> > download also contains a dump of the shell output, just in case.
> > The 
> > program ran without problems for a good two hours or so.
> > Hope this helps in debugging!
> > 
> > Kind regards,
> > 
> > Stefan
> > 
> > 
> > On 24/09/2018 22:38, Marcus Müller wrote:
> > > Hi Stefan,
> > > 
> > > so I've talked to our main software sustainance hero, and we
> > > rather
> > > quickly came to the conclusion that it's pretty likely you should
> > > move
> > > on to the head of the 3.13 branch (remotes/origin/UHD-3.13). Are
> > > you
> > > building from source, or are you using binary packages?
> > > 
> > > Best regards,
> > > Marcus
> > > 
> > > On Mon, 2018-09-24 at 20:04 +0200, Marcus Müller wrote:
> > > > Hi Stefan,
> > > > 
> > > > I know it's not of great comfort when an engineer finds a
> > > > problem to
> > > > be
> > > > /interesting/, but yours certainly is.
> > > > So, first things first: if the computational power and memory
> > > > of the
> > > > host that your USRP is connected to allows, it might be good to
> > > > have
> > > > a
> > > > packet capture in some kind of a ring buffer, so that you can
> > > > infer a
> > > > bit on the state at the point where things go wrong:
> > > > 
> > > > tcpdump -n # no DNS lookups
> > > > -i <your network device here>
> > > > -s 0 # don't stop after a finite amount of packets
> > > > -C 400 # 400 million bytes per capture file
> > > > -W 2 # rotate through three files of capturs
> > > > -w /tmp/rotate.pcap # make sure that you're using a file that's
> > > > on an
> > > > RAM filesystem; if in doubt, make one with `mount -t tmpfs
> > > > tmpfs
> > > > /path`
> > > > 
> > > > So, yes, using an MTU of 8000 would be the first thing that the
> > > > Ettus
> > > > hivemind would recommend, too, but if you say things still go
> > > > wrong,
> > > > we
> > > > might need to dig deeper.
> > > > 
> > > > I do know that we've improved the bus clocking, and that had
> > > > impact
> > > > on
> > > > the X300 firmware. Is trying the last 3.10 release an option
> > > > for you?
> > > > 
> > > > Best regards,
> > > > Marcus
> > > > 
> > > > On Mon, 2018-09-24 at 09:23 +0200, Stefan van der Linden via
> > > > USRP-
> > > > users
> > > > wrote:
> > > > > Hi,
> > > > > 
> > > > > We are in the process of prototyping a setup using an X300
> > > > > with two
> > > > > UBX-40 daughterboards to be used in the validation of an
> > > > > externally
> > > > > provided signal source. The daughterboards are each dedicated
> > > > > to
> > > > > one
> > > > > of two tasks: transmitting a pre-recorded reference signal in
> > > > > a
> > > > > loop
> > > > > at 50 MSps, and capturing that same signal again after
> > > > > passing
> > > > > through a chain of devices under test at 25MSps. This is to
> > > > > run
> > > > > continuously up to 24 hours.
> > > > > 
> > > > > The X300 is connected to the (dedicated) host computer via a
> > > > > 10Gbps
> > > > > connection to an Intel X520-DA2 NAC over SFP+. On the host,
> > > > > we are
> > > > > currently using the kitchen_sink utility included with UHD to
> > > > > run
> > > > > the
> > > > > system in multi-channel mode. We are using UHD 3.11.0.1.
> > > > > 
> > > > > The system works flawlessly when performing short
> > > > > measurements
> > > > > (say,
> > > > > up to an hour or so). However, having recently started
> > > > > setting up
> > > > > the
> > > > > system for long 24 hour tests, we are seeing timeouts from
> > > > > which
> > > > > UHD
> > > > > is unable to recover. These timeouts occur randomly:
> > > > > sometimes they
> > > > > occur after ~1 hours, other times they occur after ~18 hours
> > > > > and
> > > > > everywhere in between. Naturally, this random behaviour makes
> > > > > it
> > > > > difficult to debug.
> > > > > 
> > > > > The error message retrieved from UHD is as follows:
> > > > > 
> > > > > As previous messages on this list have mentioned varying the
> > > > > MTU
> > > > > settings (for example:
> > > > > 
> > > 
> > > 


http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/2014-November/039561.html
> > >  
> > > 
> > > > > ), this was the first thing we tried. Unfortunately, these
> > > > > timeouts
> > > > > occur more often at lower MTU values.
> > > > > 
> > > > > Hopefully someone is able to point us in the right direction.
> > > > > Perhaps
> > > > > we are dealing with hardware issues here, but I'd I hope we
> > > > > are
> > > > > able
> > > > > to solve this through software.
> > > > > 
> > > > > Thanks,
> > > > > Stefan van der Linden
> > > > > _______________________________________________
> > > > > USRP-users mailing list
> > > > > USRP-users@lists.ettus.com
> > > > > 

http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com



_______________________________________________
USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

Re: [USRP-users] IOError: x300 fw poke32 - reply timed out

Reply via email to