Re: [Discuss-gnuradio] Debugging the Source of Dropped Samples

2018-02-14 Thread CEL
Hi Gilad!

Thank you for the feedback; happy to hear you've got that sorted out
through observation of the buffer fillage.
Yeah, we need to get smarter on how to understand and prevent such
situations.

Kernelshark not being the most intuitive tool here is understandable.
I'll have to look into how to build tools that help us!

Best regards,
Marcus

On Wed, 2018-02-14 at 08:08 +, Gilad Beeri (ApolloShield) wrote:
> Hi Marcus,
> Thanks for the detailed answer.
> After a long period, I managed to work on it (in several steps) during the 
> last couple of months.
> 
> I managed to find the bottleneck using ControlPorts. It showed a couple of 
> blocks that were bottlenecks (based on previous block's output buffers). 
> Interestingly, they weren't IO-bound. I haven't investigated it further, but 
> I guess that the facts that (i) they caused dropped samples; (ii) were 
> CPU-bound; and (iii) did not reach anywhere near 100% in htop indicate that 
> they might have high fluctuations in CPU usage, it might be that they do use 
> 100% CPU for a very short time (then they cause dropped samples), shorter 
> than the sampling and averaging periods of tools like htop. Optimizing those 
> blocks solved the issue of dropped samples.
> 
> I only gave kernelshark little attention. It didn't prove to be a simple tool 
> to use out of the box, I don't think that the docs and level of polish are 
> well enough for this general use case. Maybe, if someone takes it as a 
> project and documents how to apply it to flowgraph inspection.
> 
> 
> 
> On Tue, Nov 7, 2017 at 1:25 PM Marcus Müller  wrote:
> > Hi Gilad,
> > 
> > part of this is for the future reader of this thread, so, please, bear
> > with me:
> > 
> > On 07.11.2017 10:42, Gilad Beeri (ApolloShield) wrote:
> > > I have a flowgraph, that when run, no CPU core is ever close to 100%
> > > utilization.
> > 
> > Indeed, dropped samples indicate a bottleneck narrower than your USRP's
> > sampling rate, but that bottleneck doesn't have to be CPU overutilization!
> > Simplest example: add a Throttle block to a flow graph that otherwise
> > wouldn't produce any problems with half the necessary sampling rate.
> > 
> > Most often, I find that IO operations actually become the the bottleneck
> > – be it that sending samples to the USRP (or receiving them) is actually
> > pretty time-intense, or that you need to interact with storage.
> > 
> > Depending on the tooling you choose, this fact might or might not be
> > hidden; time spent, for example "on behalf" of a thread in Kernel land,
> > searching for a contiguous piece of memory to give to that process, or
> > handling USB buffers or... might or might not be attributed to the process.
> > 
> > Another very classical problem is memory bandwidth and latency; so, as
> > shown by SE at this year's GRCon, chances aren't that bad that you can
> > optimize quite a bit if you co-locate connected blocks on the same CPU,
> > you get a caching advantage (or, rather, not incur a disadvantage).
> > 
> > That all being said, how do you proceed?
> > 
> > First of all, this is one of the cases where having ControlPort is very
> > helpful. If you have it (with Thrift and PerfCounters enabled), you can
> > start the CtrlPort Performance Monitor, and see which output buffers
> > "stay full" all the time. Block after that is probably your bottleneck.
> > 
> > If you don't, try running `perf top -ag` (as root might help here, you
> > want to also inspect kernel times, not quite sure about that, though).
> > You should be getting a listing of "when we sampled where the CPU(s)
> > were, in x % of the time, they were stuck in these functions".
> > 
> > I really tried, but haven't had the time to work with kernelshark. That
> > might really be a tool of choice here. In fact, it looks so cool that I
> > could imagine that we one day supersede the perf counter concept with
> > that; who knows.
> > 
> > If you do happen to look into that, I'd be very happy to get some
> > feedback about the process, and what the problems were. I think this is
> > definitely something we want to enable users to do – understand not only
> > the behaviour of their blocks in isolation, but how a system works.
> > After all, one of the major "let's dream about a GNU Radio in the
> > future" things we're considering is making it easy to distribute a flow
> > graph across computers, and for that, systemic insight pretty much is a
> > must.
> > 
> > Best regards,
> > Marcus
> > 
> > ___
> > Discuss-gnuradio mailing list
> > Discuss-gnuradio@gnu.org
> > https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

smime.p7s
Description: S/MIME cryptographic signature
___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio


Re: [Discuss-gnuradio] Debugging the Source of Dropped Samples

2018-02-14 Thread Gilad Beeri (ApolloShield)
Hi Marcus,
Thanks for the detailed answer.
After a long period, I managed to work on it (in several steps) during the
last couple of months.

I managed to find the bottleneck using ControlPorts. It showed a couple of
blocks that were bottlenecks (based on previous block's output buffers).
Interestingly, they weren't IO-bound. I haven't investigated it further,
but I guess that the facts that (i) they caused dropped samples; (ii) were
CPU-bound; and (iii) did not reach anywhere near 100% in htop indicate that
they might have high fluctuations in CPU usage, it might be that they do
use 100% CPU for a very short time (then they cause dropped samples),
shorter than the sampling and averaging periods of tools like htop.
Optimizing those blocks solved the issue of dropped samples.

I only gave kernelshark little attention. It didn't prove to be a simple
tool to use out of the box, I don't think that the docs and level of polish
are well enough for this general use case. Maybe, if someone takes it as a
project and documents how to apply it to flowgraph inspection.



On Tue, Nov 7, 2017 at 1:25 PM Marcus Müller  wrote:

> Hi Gilad,
>
> part of this is for the future reader of this thread, so, please, bear
> with me:
>
> On 07.11.2017 10:42, Gilad Beeri (ApolloShield) wrote:
> > I have a flowgraph, that when run, no CPU core is ever close to 100%
> > utilization.
>
> Indeed, dropped samples indicate a bottleneck narrower than your USRP's
> sampling rate, but that bottleneck doesn't have to be CPU overutilization!
> Simplest example: add a Throttle block to a flow graph that otherwise
> wouldn't produce any problems with half the necessary sampling rate.
>
> Most often, I find that IO operations actually become the the bottleneck
> – be it that sending samples to the USRP (or receiving them) is actually
> pretty time-intense, or that you need to interact with storage.
>
> Depending on the tooling you choose, this fact might or might not be
> hidden; time spent, for example "on behalf" of a thread in Kernel land,
> searching for a contiguous piece of memory to give to that process, or
> handling USB buffers or... might or might not be attributed to the process.
>
> Another very classical problem is memory bandwidth and latency; so, as
> shown by SE at this year's GRCon, chances aren't that bad that you can
> optimize quite a bit if you co-locate connected blocks on the same CPU,
> you get a caching advantage (or, rather, not incur a disadvantage).
>
> That all being said, how do you proceed?
>
> First of all, this is one of the cases where having ControlPort is very
> helpful. If you have it (with Thrift and PerfCounters enabled), you can
> start the CtrlPort Performance Monitor, and see which output buffers
> "stay full" all the time. Block after that is probably your bottleneck.
>
> If you don't, try running `perf top -ag` (as root might help here, you
> want to also inspect kernel times, not quite sure about that, though).
> You should be getting a listing of "when we sampled where the CPU(s)
> were, in x % of the time, they were stuck in these functions".
>
> I really tried, but haven't had the time to work with kernelshark. That
> might really be a tool of choice here. In fact, it looks so cool that I
> could imagine that we one day supersede the perf counter concept with
> that; who knows.
>
> If you do happen to look into that, I'd be very happy to get some
> feedback about the process, and what the problems were. I think this is
> definitely something we want to enable users to do – understand not only
> the behaviour of their blocks in isolation, but how a system works.
> After all, one of the major "let's dream about a GNU Radio in the
> future" things we're considering is making it easy to distribute a flow
> graph across computers, and for that, systemic insight pretty much is a
> must.
>
> Best regards,
> Marcus
>
> ___
> Discuss-gnuradio mailing list
> Discuss-gnuradio@gnu.org
> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>
___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio


Re: [Discuss-gnuradio] Debugging the Source of Dropped Samples

2017-11-07 Thread Marcus Müller
Hi Gilad,

part of this is for the future reader of this thread, so, please, bear
with me:

On 07.11.2017 10:42, Gilad Beeri (ApolloShield) wrote:
> I have a flowgraph, that when run, no CPU core is ever close to 100%
> utilization.

Indeed, dropped samples indicate a bottleneck narrower than your USRP's
sampling rate, but that bottleneck doesn't have to be CPU overutilization!
Simplest example: add a Throttle block to a flow graph that otherwise
wouldn't produce any problems with half the necessary sampling rate.

Most often, I find that IO operations actually become the the bottleneck
– be it that sending samples to the USRP (or receiving them) is actually
pretty time-intense, or that you need to interact with storage.

Depending on the tooling you choose, this fact might or might not be
hidden; time spent, for example "on behalf" of a thread in Kernel land,
searching for a contiguous piece of memory to give to that process, or
handling USB buffers or... might or might not be attributed to the process.

Another very classical problem is memory bandwidth and latency; so, as
shown by SE at this year's GRCon, chances aren't that bad that you can
optimize quite a bit if you co-locate connected blocks on the same CPU,
you get a caching advantage (or, rather, not incur a disadvantage).

That all being said, how do you proceed?

First of all, this is one of the cases where having ControlPort is very
helpful. If you have it (with Thrift and PerfCounters enabled), you can
start the CtrlPort Performance Monitor, and see which output buffers
"stay full" all the time. Block after that is probably your bottleneck.

If you don't, try running `perf top -ag` (as root might help here, you
want to also inspect kernel times, not quite sure about that, though).
You should be getting a listing of "when we sampled where the CPU(s)
were, in x % of the time, they were stuck in these functions".

I really tried, but haven't had the time to work with kernelshark. That
might really be a tool of choice here. In fact, it looks so cool that I
could imagine that we one day supersede the perf counter concept with
that; who knows.

If you do happen to look into that, I'd be very happy to get some
feedback about the process, and what the problems were. I think this is
definitely something we want to enable users to do – understand not only
the behaviour of their blocks in isolation, but how a system works.
After all, one of the major "let's dream about a GNU Radio in the
future" things we're considering is making it easy to distribute a flow
graph across computers, and for that, systemic insight pretty much is a
must.

Best regards,
Marcus



smime.p7s
Description: S/MIME Cryptographic Signature
___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio