Re: [USRP-users] B200 USB high CPU usage

2018-02-17 Thread Marcus Müller via USRP-users
Hi Kasper,

SIMD is fine and all, for sample processing in the end (for example, to
convert the samples from on-the-wire format to something your CPU can
calculate with), but it's really no "fix it all" for this kind of
thing: 
Significant workload is handling the fact that your USB controller has
received a package and (hopefully) DMA'ed that somewhere without CPU
intervention (handling DMA interrupts might include having to do a
context switch), then getting that packet, understanding it, assigning
it to the right USB stack chain, handing it over to libUSB, context
switch, getting that packet's content out of the packet, checking the
stream IDs/sequence numbers, and only then can you actually deal with
the samples with SIMD.
That whole kernel-side USB handling happens for every single USB
transfer. That's a lot of non-vectorizable, hardly parallelizable CPU
work to be done. The larger recv_frame_sizes reduce the handling
overhead.

So, SIMD is heavily employed in our OTW-to-CPU format converters
(which, by the way, also are multithreaded AND double as the step that
moves the data from the USB buffer to the user-supplied buffer array,
which eliminates the need for memcpy), but it's only vectorizing what
can be vectorized.

Best regards,
Marcus

On Wed, 2018-02-14 at 14:03 +0100, Kasper Føns via USRP-users wrote:
> Hi Nate and Marcus.
> I didn't receive Markus' mail, but could read it on the archive.
> I am having problems just receiving the 56e6 samples up from the
> device. I am not writing them to disk (see the '--null' argument that
> avoids write calls to disk), nor processing them in any way.
> I am using UHD 3.10.3.0 (maint branch). Images are fetched using the
> supplied populate_images.py script.
> I feel that 100 instructions per sample is way too high since the
> CPU's has single instruction, multiple data (SIMD) instructions,
> which seems to be in use (saw in CPU profiler).
> I totally agree that writing these samples to disk is a though order,
> which we are not interested in. The rx_samples_to_file program was
> used as a benchmark program, but as Nate suggested, the
> benchmark_rate is better suited for this.
> Meltdown is not patched on the Linux host. Altered governor to
> 'performance'. No throttling.
> Thank you for suggesting the sc12 wire format. It was known that the
> precisions was only 12 bits, but should of course be used for
> benchmarking.
> 
> Following Marcus' suggestion of altering num_recv_frames only, we
> have been able to receive 56 Msps reliably on the i5-3210M Linux
> host. Still using ~70%.
> We cannot reach 56Msps on Windows, not even close. We tried to run
>   start /B /wait /realtime ./benchmark_rate.exe --rx_otw sc12 --
> rx_rate 40e6 --args="num_recv_frames=43"
> but it fails with occasional overflows.
> It seems only viable solution is to run on Linux, if 56Msps is
> required.
> Regerds,
> Kasper
> Den 13-02-2018 kl. 21:39 skrev Nate Temple:
> > Hi Kasper,
> > 
> > There are several caveats/issues/topics to consider with regards to
> > running at higher sample rates with the B2xx. Generally speaking,
> > Linux will offer better performance than Windows. 
> > 
> > What version of UHD are you using? If you're not using UHD
> > 3.10.3.0, can you please try upgrading? UHD 3.10.3.0 includes a
> > commit [1] and updated firmware, which optimizes the FX3
> > performance. 
> > 
> > The i5-3210M may be slightly under-powered for the task, however
> > you can try all the following performance tuning adjustments, and
> > it may be able to support 56 MS/s.
> > 
> > It is worth noting that the recent KPTI patches and other related
> > workarounds [2] for Intel CPUs to protect against Meltdown/Spectre
> > attacks [3], may cause a considerable overhead. Here [4] are
> > instructions on how to check to see if KPTI is enabled for Ubuntu.
> > You may want to disable KPTI, if it is enabled on your system, then
> > test to see how much additional overhead it creates, running your
> > application. 
> > 
> > Adjust your CPU Governor to "performance". This can be done with
> > the cpu-frequtils utility [5]. ( sudo cpufreq-set -g performance )
> > 
> > Ensure your CPU is not throttling due to overheating ( sudo cat
> > /var/log/syslog | grep throttled ). This is very common in laptops,
> > especially older devices where the thermal grease is in need of
> > replacement.
> > 
> > You can test using sc8 and sc12 OTW (over the wire) [6] sample
> > sizes with the benchmark_rate [7] utility. Using sc12 will not drop
> > any information as the ADC/DAC on the B2xx is 12bits. 
> > 
> > ./benchmark_rate --rx_otw sc12 --rx_rate 40e6
> > ./benchmark_rate --tx_otw sc8 --tx_rate 40e6
> > 
> > Some USB controllers can be problematic. Intel Series 7/8/9 USB
> > controllers usually offer the best performance. 
> > 
> > Using Thinkpads (T430s, T470p) I've found that a recv/send frame
> > size of 8192 tends to work the best at higher sample rates.
> > 
> > As Marcus mentioned, the UHD examples are 

Re: [USRP-users] B200 USB high CPU usage

2018-02-14 Thread Kasper Føns via USRP-users
Hi Nate and Marcus.

I didn't receive Markus' mail, but could read it on the archive.

I am having problems just receiving the 56e6 samples up from the device.
I am not writing them to disk (see the '--null' argument that avoids
write calls to disk), nor processing them in any way.

I am using UHD 3.10.3.0 (maint branch). Images are fetched using the
supplied populate_images.py script.

I feel that 100 instructions per sample is way too high since the CPU's
has single instruction, multiple data (SIMD) instructions, which seems
to be in use (saw in CPU profiler).

I totally agree that writing these samples to disk is a though order,
which we are not interested in. The rx_samples_to_file program was used
as a benchmark program, but as Nate suggested, the benchmark_rate is
better suited for this.

Meltdown is not patched on the Linux host. Altered governor to
'performance'. No throttling.

Thank you for suggesting the sc12 wire format. It was known that the
precisions was only 12 bits, but should of course be used for benchmarking.


Following Marcus' suggestion of altering num_recv_frames only, we have
been able to receive 56 Msps reliably on the i5-3210M Linux host. Still
using ~70%.

We cannot reach 56Msps on Windows, not even close. We tried to run
  start /B /wait /realtime ./benchmark_rate.exe --rx_otw sc12 --rx_rate
40e6 --args="num_recv_frames=43"
but it fails with occasional overflows.

It seems only viable solution is to run on Linux, if 56Msps is required.

Regerds,
Kasper

Den 13-02-2018 kl. 21:39 skrev Nate Temple:
> Hi Kasper,
>
> There are several caveats/issues/topics to consider with regards to
> running at higher sample rates with the B2xx. Generally speaking,
> Linux will offer better performance than Windows. 
>
> What version of UHD are you using? If you're not using UHD 3.10.3.0,
> can you please try upgrading? UHD 3.10.3.0 includes a commit [1] and
> updated firmware, which optimizes the FX3 performance. 
>
> The i5-3210M may be slightly under-powered for the task, however you
> can try all the following performance tuning adjustments, and it may
> be able to support 56 MS/s.
>
> It is worth noting that the recent KPTI patches and other related
> workarounds [2] for Intel CPUs to protect against Meltdown/Spectre
> attacks [3], may cause a considerable overhead. Here [4] are
> instructions on how to check to see if KPTI is enabled for Ubuntu. You
> may want to disable KPTI, if it is enabled on your system, then test
> to see how much additional overhead it creates, running your application. 
>
> Adjust your CPU Governor to "performance". This can be done with the
> cpu-frequtils utility [5]. ( sudo cpufreq-set -g performance )
>
> Ensure your CPU is not throttling due to overheating ( sudo cat
> /var/log/syslog | grep throttled ). This is very common in laptops,
> especially older devices where the thermal grease is in need of
> replacement.
>
> You can test using sc8 and sc12 OTW (over the wire) [6] sample sizes
> with the benchmark_rate [7] utility. Using sc12 will not drop any
> information as the ADC/DAC on the B2xx is 12bits. 
>
> ./benchmark_rate --rx_otw sc12 --rx_rate 40e6
> ./benchmark_rate --tx_otw sc8 --tx_rate 40e6
>
> Some USB controllers can be problematic. Intel Series 7/8/9 USB
> controllers usually offer the best performance. 
>
> Using Thinkpads (T430s, T470p) I've found that a recv/send frame size
> of 8192 tends to work the best at higher sample rates.
>
> As Marcus mentioned, the UHD examples are provided as an API reference
> and not tuned for performance. Case in point is rx_samples_to_file
> being single threaded. GNU Radio will by default offer a
> multi-threaded architecture, which can be useful to test. You may need
> to adjust the min buffer sizes to handle the higher sample rates
> however within the GR Blocks.
>
> I've attached an example of rx_samples_to_file.cpp which is
> multi-thread and has additional buffering.
>
> Without a SSD or NVMe hard drive, sustaining a high sample rate to
> disk can be difficult. Depending upon your system configuration, you
> may want to consider using a ram disk. I would recommend leaving at
> least 2-8 GB of ram for your host OS (this is dependent upon your
> application etc). This will however limit the length of time you can
> save to disk (as limited by the ram in the machine). Below is an
> example to create a 24GB ramdisk:
>
> mkdir -p ~/ramfs
> mount -t tmpfs -o size=24G tmpfs ~/ramfs
>
>
> [1] -
> https://github.com/EttusResearch/uhd/commit/d95613152da3e7c7f41c71acca65101ed0896893
> [2] - https://en.wikipedia.org/wiki/Kernel_page-table_isolation
> [3] - https://en.wikipedia.org/wiki/Meltdown_(security_vulnerability)
> 
> [4] -
> https://askubuntu.com/questions/992137/how-to-check-that-kpti-is-enabled-on-my-ubuntu
> [5] - http://www.thinkwiki.org/wiki/How_to_use_cpufrequtils
> [6] - 

Re: [USRP-users] B200 USB high CPU usage

2018-02-13 Thread Nate Temple via USRP-users
Hi Kasper,

There are several caveats/issues/topics to consider with regards to running
at higher sample rates with the B2xx. Generally speaking, Linux will offer
better performance than Windows.

What version of UHD are you using? If you're not using UHD 3.10.3.0, can
you please try upgrading? UHD 3.10.3.0 includes a commit [1] and updated
firmware, which optimizes the FX3 performance.

The i5-3210M may be slightly under-powered for the task, however you can
try all the following performance tuning adjustments, and it may be able to
support 56 MS/s.

It is worth noting that the recent KPTI patches and other related
workarounds [2] for Intel CPUs to protect against Meltdown/Spectre attacks
[3], may cause a considerable overhead. Here [4] are instructions on how to
check to see if KPTI is enabled for Ubuntu. You may want to disable KPTI,
if it is enabled on your system, then test to see how much additional
overhead it creates, running your application.

Adjust your CPU Governor to "performance". This can be done with the
cpu-frequtils utility [5]. ( sudo cpufreq-set -g performance )

Ensure your CPU is not throttling due to overheating ( sudo cat
/var/log/syslog | grep throttled ). This is very common in laptops,
especially older devices where the thermal grease is in need of replacement.

You can test using sc8 and sc12 OTW (over the wire) [6] sample sizes with
the benchmark_rate [7] utility. Using sc12 will not drop any information as
the ADC/DAC on the B2xx is 12bits.

./benchmark_rate --rx_otw sc12 --rx_rate 40e6
./benchmark_rate --tx_otw sc8 --tx_rate 40e6

Some USB controllers can be problematic. Intel Series 7/8/9 USB controllers
usually offer the best performance.

Using Thinkpads (T430s, T470p) I've found that a recv/send frame size of
8192 tends to work the best at higher sample rates.

As Marcus mentioned, the UHD examples are provided as an API reference and
not tuned for performance. Case in point is rx_samples_to_file being single
threaded. GNU Radio will by default offer a multi-threaded architecture,
which can be useful to test. You may need to adjust the min buffer sizes to
handle the higher sample rates however within the GR Blocks.

I've attached an example of rx_samples_to_file.cpp which is multi-thread
and has additional buffering.

Without a SSD or NVMe hard drive, sustaining a high sample rate to disk can
be difficult. Depending upon your system configuration, you may want to
consider using a ram disk. I would recommend leaving at least 2-8 GB of ram
for your host OS (this is dependent upon your application etc). This will
however limit the length of time you can save to disk (as limited by the
ram in the machine). Below is an example to create a 24GB ramdisk:

mkdir -p ~/ramfs
mount -t tmpfs -o size=24G tmpfs ~/ramfs


[1] -
https://github.com/EttusResearch/uhd/commit/d95613152da3e7c7f41c71acca65101ed0896893
[2] - https://en.wikipedia.org/wiki/Kernel_page-table_isolation
[3] - https://en.wikipedia.org/wiki/Meltdown_(security_vulnerability)
[4] -
https://askubuntu.com/questions/992137/how-to-check-that-kpti-is-enabled-on-my-ubuntu
[5] - http://www.thinkwiki.org/wiki/How_to_use_cpufrequtils
[6] - https://files.ettus.com/manual/page_stream.html#stream_datatypes_otw
[7] -
https://github.com/EttusResearch/uhd/blob/maint/host/examples/benchmark_rate.cpp

Regards,
Nate Temple

On Tue, Feb 13, 2018 at 11:59 AM, Marcus D. Leech via USRP-users <
usrp-users@lists.ettus.com> wrote:

> On 02/13/2018 04:52 AM, Kasper Føns via USRP-users wrote:
>
>> Hi.
>>
>> We have bought a B200 board and are having issues simply receiving the
>> samples and would like some support in the matter.
>>
>> Running the command
>>./rx_samples_to_file --null --rate 5600
>> on a Sony Vaio Z with an I5-3210M running Ubuntu Server 17.10 shows a
>> high CPU usage of ~78%.
>>
>> Is such a high CPU usage expected?
>> Switching terminal windows (ALT + F1 or ALT + F2) is enough to cause an
>> overflow on the Linux host.
>>
>>
>> There is also a high CPU usage on a Windows 10 machine (ThinkPad,
>> i7-4810MQ).
>> Running
>>./rx_samples_to_file --null --rate 5600
>> results in a infinite stream of overflows.
>>
>> Running
>>./rx_samples_to_file --null --rate 3200
>> utilizes 22% CPU and still overflows once in a while. Moving a
>> calculator window around the screen results in overflows.
>>
>> We have tried increasing buffers using --args="recv_frame_size=X,
>> num_recv_frames=Y"
>> However, we haven't been able to increase X to higher values than ~16000
>> (16384 fails with lots of overflows).
>> The same applies to Y, where 300 fails with an error.
>>
>> The software was compiled in release mode an ran over a USB 3 connection.
>>
>> Thus, for USB transfers using the B200:
>>   - On Vaio: Is ~78% CPU usage expected for 56 Msps ?
>>   - On Win10: Is it not possible to receive 56 Msps?
>>   - On Win10: Is 22% CPU usage expected for 32 Msps?
>>   - Is there some limit to recv_frame_size? A value of 16384 

Re: [USRP-users] B200 USB high CPU usage

2018-02-13 Thread Marcus D. Leech via USRP-users

On 02/13/2018 04:52 AM, Kasper Føns via USRP-users wrote:

Hi.

We have bought a B200 board and are having issues simply receiving the
samples and would like some support in the matter.

Running the command
   ./rx_samples_to_file --null --rate 5600
on a Sony Vaio Z with an I5-3210M running Ubuntu Server 17.10 shows a
high CPU usage of ~78%.

Is such a high CPU usage expected?
Switching terminal windows (ALT + F1 or ALT + F2) is enough to cause an
overflow on the Linux host.


There is also a high CPU usage on a Windows 10 machine (ThinkPad,
i7-4810MQ).
Running
   ./rx_samples_to_file --null --rate 5600
results in a infinite stream of overflows.

Running
   ./rx_samples_to_file --null --rate 3200
utilizes 22% CPU and still overflows once in a while. Moving a
calculator window around the screen results in overflows.

We have tried increasing buffers using --args="recv_frame_size=X,
num_recv_frames=Y"
However, we haven't been able to increase X to higher values than ~16000
(16384 fails with lots of overflows).
The same applies to Y, where 300 fails with an error.

The software was compiled in release mode an ran over a USB 3 connection.

Thus, for USB transfers using the B200:
  - On Vaio: Is ~78% CPU usage expected for 56 Msps ?
  - On Win10: Is it not possible to receive 56 Msps?
  - On Win10: Is 22% CPU usage expected for 32 Msps?
  - Is there some limit to recv_frame_size? A value of 16384 fails with
infinite overflows.
  - Is there some way of tuning the framework for lower CPU?

W___
USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

e hope you are able to help!

Let's do some first-order math.

You're bringing in ~5e7 samples/second
If we optimistically assume a mean instructions-per-sample (including 
both kernel and user-space code) of 100 instructions/sample, then we're 
talking a
  requirement for 5e9 instructions/second.  If your CPU is running at 
3e9 Hz, then it'll need to, on average, issue 1.6 instructions/cycle.


You'll generally get more "mileage" out of num_recv_frames than the 
frame size.  On any given system, my understanding is that this is a 
shared resource

  (across a given USB controller, I think, but don't quote me).

Now, the rx_samples_to_file application is single-threaded, so it's 
trying to service the data-stream from the USRP at the same time as it's 
making filesystem
  calls to write the data (even if to /dev/null). That's a tall order 
for a single-threaded application running at 5.6e7sps.   These 
applications, provided with
  UHD, are generally intended as *coding examples*, and no guarantees 
exist with respect to performance on any given system. Furthermore, some
  USB3 controllers are better at handling bulk high-data-rate 
applications than others, and the controller landscape changes so 
quickly that it's next to

  impossible to provide up-to-date recommendations in that department.

If you install Gnu Radio, there's an application called "uhd_rx_cfile" 
that takes advantage of the multi-threaded nature of Gnu Radio, and does 
better.


But keep *firmly in mind* that once you migrate from writing to 
/dev/null to writing to real disk hardware, 5e7 samples/second is going 
to result in
  a LOT of disk I/O--more than most ordinary single-disk, non-RAID disk 
systems can usefully sustain.




___
USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com