Re: [music-dsp] Supervised DSP architectures (vs. push/pull)

2016-08-02 Thread Andy Farnell
Dreaming about novel real-time DSP architectures... bottom up? 

I find this discussion and general problem of DSP architectures
suited to parallel computation exciting.
Its something I've pondered while considering a problem in
the implementation layer of procedural audio, which is 'level
of audio detail', simply the sound-design principle that not
every sonic detail needs computing perfectly all the time, that
good enough models can be 'computationally elastic'. 

Indeed in games, as Ethan F indicates in the above post,
material is often wide rather than deep, with lots of 
contributory signals, and some papers (search SIGGRAPH) have 
been written on perceptual prioritisation in games.

Of course a good solution is also one that allows dynamic 
reconfiguration of DSP graphs, but also one that seems to need
all the trapping of operating system principles, prioritisation,
critical path/dependency solving, cache prediction, scheduling,
cost estimation, a-priori metrics, etc. 

Although I kind of abandoned that line of thought, in honesty
due to the lazy thought that raw CPU capability would overtake
my ambitions, there are indeed certain sound models that are 
really rather hard to express within traditional DSP frameworks.
An example is fragmentation, as a recursive (forking) 'particle 
system' of ever smaller pieces, each  with decreasing level of 
detail. I imagine this elegantly expressed in LISP and easily 
executed on a multiple processors. And I can see other 
applications, perhaps for new audio effects that are adaptive to 
the changing complexity of incoming material.

But the fear I felt when thinking about "supervision" is twofold

1) We need reliable knowledge about DSP processes
   i) Order of growth in time and space
  ii) Anomolies, discontinuities, instablilities 
 iii) Significance (perhaps perceptual model)
 
.. and that knowledge might not be so reliable and consistent
As Ross said, some are not easily computable, and many of these 
issues in (Letz, Fober, Orlarey, Davis) the Jack paper just get
worse the more cores (and ICC paths) you add.
Again, this gets worse when the material is interactive, as in games,
and where you may want to adapt the level of audio detail 
on the fly.

2) That deep, synchronous audio systems are always 'brittle', one 
thing fails and everything fails, and at some point complexity and 
explicit rules at the supervisor level just get too much to create 
effects and instruments that are certain not to glitch during 
performance. 

Its like 'real-time' and massively concurrent dont mix well.

So I got wondering if _super_ vision is the wrong way of
looking at this for audio. Please humour a fool for a moment.
Instead of thinking like 'kernel', what can we learn from Hurd
and Minix? What can we learn from networking and massively 
concurrent asychronous systems that have failure built in as 
assumptions?

1) DSP nodes that can advertise capability
2) Processes that can solicit work with constraints
3) Opportunistic routing through available resources
4) Time to live for low priority contributory signals
5) Soft limits, cybernetics (correction and negative feedback)

So, if you were to think like 1960's DARPA and say " I want to 
construct a DSP processor based on nodes where many could be 
taken out by 'enemy action', and still get a 'good enough' signal 
throughput and latency" - what would that look like? 

Approaching this way, what you get probably looks horribly inefficient
for small systems where the inter-process bureaucracy dominates,
but really very scalable too, and doing better and better as the 
complexity increases rather than worse.

cheers,
Andy

 

On Mon, Aug 01, 2016 at 12:16:38PM -0500, Evan Balster wrote:
> Here's my current thinking.  Based on my current and foreseeable future
> use-cases, I see just a few conditions that would play into automatic
> prioritization:
> 
>- (A) Does the DSP depend on a real-time input?
>- (B) Does the DSP factor into a real-time output?
>- (C) Does the DSP produce side-effects?  (EG. observers, sends to
>application thread)
> 
> Any chain of effects with exactly one input and one output could be grouped
> into a single task with the same priority.  Junction points whose sole
> input or sole output is such a chain could also be part of it.
> 
> This would yield a selection of DSP jobs which would be, by default,
> prioritized thus:
> 
>1. A+B+C
>2. A+B
>3. A+C
>4. B+C
>5. B
>6. C
> 
> Any DSPs which do not factor into real-time output or side-effects could
> potentially be skipped (though it's worth considering that DSPs will
> usually have state which we may want updating).
> 
> It is possible that certain use-cases may favor quick completion of
> real-time processing over latency of observer data.  In that case, the
> following scheme could be used instead:
> 
>1. A+B (and A+B+C)
>2. B+C
>3. B
>4. A+C
>5. C
> 
> (Where steps 4 and 5 may occur after the 

Re: [music-dsp] Supervised DSP architectures (vs. push/pull)

2016-08-01 Thread Evan Balster
Here's my current thinking.  Based on my current and foreseeable future
use-cases, I see just a few conditions that would play into automatic
prioritization:

   - (A) Does the DSP depend on a real-time input?
   - (B) Does the DSP factor into a real-time output?
   - (C) Does the DSP produce side-effects?  (EG. observers, sends to
   application thread)

Any chain of effects with exactly one input and one output could be grouped
into a single task with the same priority.  Junction points whose sole
input or sole output is such a chain could also be part of it.

This would yield a selection of DSP jobs which would be, by default,
prioritized thus:

   1. A+B+C
   2. A+B
   3. A+C
   4. B+C
   5. B
   6. C

Any DSPs which do not factor into real-time output or side-effects could
potentially be skipped (though it's worth considering that DSPs will
usually have state which we may want updating).

It is possible that certain use-cases may favor quick completion of
real-time processing over latency of observer data.  In that case, the
following scheme could be used instead:

   1. A+B (and A+B+C)
   2. B+C
   3. B
   4. A+C
   5. C

(Where steps 4 and 5 may occur after the callback has been satisfied)

To make the prioritization more flexible, individual DSPs could be assigned
priority values above, below or between the automatic ones.  The priority
of a chain would be the priority of its most essential element, and chains
whose inputs have not yet been computed could be withheld from the priority
queue until such time as they are ready for processing.

Note that I haven't made much consideration toward scratch memory in the
scheme above.  The reason for this is that, as I'm realizing, the temporary
memory needs of even a complex DSP tree are small compared to permanent
memory such as samples, delay lines, et cetera.  As I'm learning recently,
cache coherency gains from memory re-use are typically most relevant within
small pieces of code.  If anyone has evidence to the contrary, though, I'd
love to see it.

– Evan Balster
creator of imitone 

On Sun, Jul 31, 2016 at 3:55 PM, Ethan Fenn  wrote:

> A few years ago I built a mixing engine for games. Some aspects of the
> design sound similar to what you're thinking about.
>
> Every audio frame (I think it was every 256 samples at 48k), the
> single-threaded "supervisor" would wake up and scan the graph of audio
> objects, figuring out what needed doing and what the dependencies were. As
> its output it produced a vector of "job" structures, describing the dsp to
> take place, where to pull the input and settings from and where to write
> the output. Then the pool of worker threads would wake up, pluck jobs one
> by one from the front of array and get to work on them.
>
> I did it this way partly because one of the target platforms was the PS3
> with the Cell processor, and this was the preferred programming model for
> using the SPE's, the auxiliary computation cores of the Cell. But it mapped
> just fine onto other platforms.
>
> I didn't have any notion of "priority" -- everything in the graph just
> needed to get done every frame. My synchronization model was also pretty
> crude. When there were any dependencies on previous steps, I would insert a
> fence into the job vector, and workers reaching the fence would just wait
> until all the previous jobs were complete before moving on. Not the most
> flexible structure for sure, but I could get away with it because my mixing
> graphs were wide rather than deep -- a lot of sound effects playing at
> once, but only very few mix/effect buses. So all of the sound effect
> processing would get parallelized perfectly, which was really all I needed
> for my needs.
>
> This kind of model is too simple for lots of applications, and it may not
> do what you need. But I'd strongly recommend thinking about what your
> typical graphs are going to look like and about what the simplest possible
> supervisor would be that would handle that kind of graph well. Writing a
> perfect general scheduler is something you could easily spend 100% of your
> time doing if you wanted to, and you probably don't!
>
> -Ethan
>
>
> On Thu, Jul 28, 2016 at 4:46 PM, Evan Balster  wrote:
>
>> Haha, Ross, I'm not sure I'll be going *quite* so deep just yet.
>>
>> My most pressing need is simply to access more processing power than one
>> callback will give me (without underflow).  To that end, I'll be setting up
>> a signaling system whereby one stream can have "helper threads" that are
>> notified when new input is available and do their best to keep output
>> available.  For the first implementation I'll permit the slave thread's
>> output to have higher latency...  Easy job.
>>
>> From there, I'm interested in starting with a single-threaded supervisor
>> which can break the processing graph into chunks and run them according to
>> a simple prioritization scheme* while keeping the 

Re: [music-dsp] Supervised DSP architectures (vs. push/pull)

2016-07-31 Thread Ethan Fenn
A few years ago I built a mixing engine for games. Some aspects of the
design sound similar to what you're thinking about.

Every audio frame (I think it was every 256 samples at 48k), the
single-threaded "supervisor" would wake up and scan the graph of audio
objects, figuring out what needed doing and what the dependencies were. As
its output it produced a vector of "job" structures, describing the dsp to
take place, where to pull the input and settings from and where to write
the output. Then the pool of worker threads would wake up, pluck jobs one
by one from the front of array and get to work on them.

I did it this way partly because one of the target platforms was the PS3
with the Cell processor, and this was the preferred programming model for
using the SPE's, the auxiliary computation cores of the Cell. But it mapped
just fine onto other platforms.

I didn't have any notion of "priority" -- everything in the graph just
needed to get done every frame. My synchronization model was also pretty
crude. When there were any dependencies on previous steps, I would insert a
fence into the job vector, and workers reaching the fence would just wait
until all the previous jobs were complete before moving on. Not the most
flexible structure for sure, but I could get away with it because my mixing
graphs were wide rather than deep -- a lot of sound effects playing at
once, but only very few mix/effect buses. So all of the sound effect
processing would get parallelized perfectly, which was really all I needed
for my needs.

This kind of model is too simple for lots of applications, and it may not
do what you need. But I'd strongly recommend thinking about what your
typical graphs are going to look like and about what the simplest possible
supervisor would be that would handle that kind of graph well. Writing a
perfect general scheduler is something you could easily spend 100% of your
time doing if you wanted to, and you probably don't!

-Ethan


On Thu, Jul 28, 2016 at 4:46 PM, Evan Balster  wrote:

> Haha, Ross, I'm not sure I'll be going *quite* so deep just yet.
>
> My most pressing need is simply to access more processing power than one
> callback will give me (without underflow).  To that end, I'll be setting up
> a signaling system whereby one stream can have "helper threads" that are
> notified when new input is available and do their best to keep output
> available.  For the first implementation I'll permit the slave thread's
> output to have higher latency...  Easy job.
>
> From there, I'm interested in starting with a single-threaded supervisor
> which can break the processing graph into chunks and run them according to
> a simple prioritization scheme* while keeping the amount of scratch-memory
> used within reasonable bounds.  In my current system, most DSPs just
> operate on their output buffer, and a scratch-memory stack is made
> available for any temporary allocations in the rendering tree...  That will
> need to change, though.
>
> Later on, though, I may very well investigate a supervisor with multi-core
> capabilities.  (I certainly want to get a better grip on multi-threaded
> scheduling for purposes outside DSP.)  I've relied thus far on a small
> number of handy lock-free abstractions** to synchronize state in my audio
> framework, but for things like worker threads I want to get a grip on the
> practicalities of using things like condition variables in a low-latency
> DSP system.
>
> – Evan Balster
> creator of imitone 
>
> * I expect a "simple prioritization scheme" to prioritize different parts
> of the graph depending on whether their inputs and/or outputs lead to
> real-time or non-real-time sources or sinks.  For instance, a microphone
> level metric might be quite high, while a synthesizer that feeds into a
> recorder (and not the speakers) would be very low.
>
>
> On Thu, Jul 28, 2016 at 12:20 AM, Ross Bencina  > wrote:
>
>> Hi Evan,
>>
>> Greetings from my little cave deep in the multi-core scheduling rabbit
>> hole! If multi-core is part of the plan, you may find that multicore
>> scheduling issues dominate the architecture. Here are a couple of starting
>> points:
>>
>> Letz, Stephane; Fober, Dominique; Orlarey, Yann; P.Davis,
>> "Jack Audio Server: MacOSX port and multi-processor version"
>> Proceedings of the first Sound and Music Computing conference – SMC’04,
>> pp. 177–183, 2004.
>> http://www.grame.fr/ressources/publications/SMC-2004-033.pdf
>>
>> CppCon 2015: Pablo Halpern “Work Stealing"
>> https://www.youtube.com/watch?v=iLHNF7SgVN4
>>
>> Re: prioritization. Whether the goal is lowest latency or highest
>> throughput, the solutions come under the category of Job Shop Scheduling
>> Problems. Large classes of multi-worker multi-job-cost scheduling problems
>> are NP-complete. I don't know where your particular problem sits. The Work
>> Stealing schedulers seem to be a popular procedure, but I'm not sure 

Re: [music-dsp] Supervised DSP architectures (vs. push/pull)

2016-07-28 Thread Evan Balster
Haha, Ross, I'm not sure I'll be going *quite* so deep just yet.

My most pressing need is simply to access more processing power than one
callback will give me (without underflow).  To that end, I'll be setting up
a signaling system whereby one stream can have "helper threads" that are
notified when new input is available and do their best to keep output
available.  For the first implementation I'll permit the slave thread's
output to have higher latency...  Easy job.

>From there, I'm interested in starting with a single-threaded supervisor
which can break the processing graph into chunks and run them according to
a simple prioritization scheme* while keeping the amount of scratch-memory
used within reasonable bounds.  In my current system, most DSPs just
operate on their output buffer, and a scratch-memory stack is made
available for any temporary allocations in the rendering tree...  That will
need to change, though.

Later on, though, I may very well investigate a supervisor with multi-core
capabilities.  (I certainly want to get a better grip on multi-threaded
scheduling for purposes outside DSP.)  I've relied thus far on a small
number of handy lock-free abstractions** to synchronize state in my audio
framework, but for things like worker threads I want to get a grip on the
practicalities of using things like condition variables in a low-latency
DSP system.

– Evan Balster
creator of imitone 

* I expect a "simple prioritization scheme" to prioritize different parts
of the graph depending on whether their inputs and/or outputs lead to
real-time or non-real-time sources or sinks.  For instance, a microphone
level metric might be quite high, while a synthesizer that feeds into a
recorder (and not the speakers) would be very low.


On Thu, Jul 28, 2016 at 12:20 AM, Ross Bencina 
wrote:

> Hi Evan,
>
> Greetings from my little cave deep in the multi-core scheduling rabbit
> hole! If multi-core is part of the plan, you may find that multicore
> scheduling issues dominate the architecture. Here are a couple of starting
> points:
>
> Letz, Stephane; Fober, Dominique; Orlarey, Yann; P.Davis,
> "Jack Audio Server: MacOSX port and multi-processor version"
> Proceedings of the first Sound and Music Computing conference – SMC’04,
> pp. 177–183, 2004.
> http://www.grame.fr/ressources/publications/SMC-2004-033.pdf
>
> CppCon 2015: Pablo Halpern “Work Stealing"
> https://www.youtube.com/watch?v=iLHNF7SgVN4
>
> Re: prioritization. Whether the goal is lowest latency or highest
> throughput, the solutions come under the category of Job Shop Scheduling
> Problems. Large classes of multi-worker multi-job-cost scheduling problems
> are NP-complete. I don't know where your particular problem sits. The Work
> Stealing schedulers seem to be a popular procedure, but I'm not sure about
> optimal heuristics for selection of work when there are multiple possible
> tasks to select -- it's further complicated by imperfect information about
> task cost (maybe the tasks have unpredictable run time), inter-core
> communication costs etc.
>
> Re: scratch storage allocation. For a single-core single-graph scenario
> you can use graph coloring (same as a compiler register allocator). For
> multi-core I guess you can do the same, but you might want to do something
> more dynamic. E.g. reuse a scratch buffer that is likely in the local CPUs
> cache.
>
> Cheers,
>
> Ross.
>
>
>
> On 28/07/2016 5:38 AM, Evan Balster wrote:
>
>> Hello ---
>>
>> Some months ago on this list, Ross Bencina remarked about three
>> prevailing "structures" for DSP systems:  Push, pull and *supervised
>> architectures*.  This got some wheels turning, and lately I've been
>> confronted by the need to squeeze more performance by adding multi-core
>> support to my audio framework.
>>
>> I'm looking for wisdom or reference material on how to implement a
>> supervised DSP architecture.
>>
>> While I have a fairly solid idea as to how I might go about it, there
>> are a few functions (such as prioritization and scratch-space
>> management) which I think are going to require some additional thought.
>>
>> ___
> dupswapdrop: music-dsp mailing list
> music-dsp@music.columbia.edu
> https://lists.columbia.edu/mailman/listinfo/music-dsp
>
>
___
dupswapdrop: music-dsp mailing list
music-dsp@music.columbia.edu
https://lists.columbia.edu/mailman/listinfo/music-dsp

Re: [music-dsp] Supervised DSP architectures (vs. push/pull)

2016-07-27 Thread Ross Bencina

Hi Evan,

Greetings from my little cave deep in the multi-core scheduling rabbit 
hole! If multi-core is part of the plan, you may find that multicore 
scheduling issues dominate the architecture. Here are a couple of 
starting points:


Letz, Stephane; Fober, Dominique; Orlarey, Yann; P.Davis,
"Jack Audio Server: MacOSX port and multi-processor version"
Proceedings of the first Sound and Music Computing conference – SMC’04, 
pp. 177–183, 2004.

http://www.grame.fr/ressources/publications/SMC-2004-033.pdf

CppCon 2015: Pablo Halpern “Work Stealing"
https://www.youtube.com/watch?v=iLHNF7SgVN4

Re: prioritization. Whether the goal is lowest latency or highest 
throughput, the solutions come under the category of Job Shop Scheduling 
Problems. Large classes of multi-worker multi-job-cost scheduling 
problems are NP-complete. I don't know where your particular problem 
sits. The Work Stealing schedulers seem to be a popular procedure, but 
I'm not sure about optimal heuristics for selection of work when there 
are multiple possible tasks to select -- it's further complicated by 
imperfect information about task cost (maybe the tasks have 
unpredictable run time), inter-core communication costs etc.


Re: scratch storage allocation. For a single-core single-graph scenario 
you can use graph coloring (same as a compiler register allocator). For 
multi-core I guess you can do the same, but you might want to do 
something more dynamic. E.g. reuse a scratch buffer that is likely in 
the local CPUs cache.


Cheers,

Ross.



On 28/07/2016 5:38 AM, Evan Balster wrote:

Hello ---

Some months ago on this list, Ross Bencina remarked about three
prevailing "structures" for DSP systems:  Push, pull and *supervised
architectures*.  This got some wheels turning, and lately I've been
confronted by the need to squeeze more performance by adding multi-core
support to my audio framework.

I'm looking for wisdom or reference material on how to implement a
supervised DSP architecture.

While I have a fairly solid idea as to how I might go about it, there
are a few functions (such as prioritization and scratch-space
management) which I think are going to require some additional thought.


___
dupswapdrop: music-dsp mailing list
music-dsp@music.columbia.edu
https://lists.columbia.edu/mailman/listinfo/music-dsp



[music-dsp] Supervised DSP architectures (vs. push/pull)

2016-07-27 Thread Evan Balster
Hello ---

Some months ago on this list, Ross Bencina remarked about three prevailing
"structures" for DSP systems:  Push, pull and *supervised architectures*.
This got some wheels turning, and lately I've been confronted by the need
to squeeze more performance by adding multi-core support to my audio
framework.

I'm looking for wisdom or reference material on how to implement a
supervised DSP architecture.

While I have a fairly solid idea as to how I might go about it, there are a
few functions (such as prioritization and scratch-space management) which I
think are going to require some additional thought.


*Background on my use case*:  (optional reading)

I have a mature audio processing framework that I use in a number of
applications.  (I may open-source it in the future.)

   - *imitone*, which performs low-latency, CPU-intensive audio analysis.
   I plan on adding support for signaled multi-core processing for users who
   want to process many voices from a single device.

   - *SoundSelf*, a VR application which places excessive demands on audio
   rendering with close to a thousand DSPs routinely operating in the output
   stream.  (The sound designer has free reign and uses it!)  The application
   also uses one or more input streams which may be ring-buffered to output.
   I plan on adding worker threads for less latency-sensitive audio.

My current architecture is pull-based:  The DSP graph assumes a tree
structure, where each unit encapsulates both DSP and state-synchronization,
and "owns" its inputs, propagating sync events and pulling audio as needed.

This scheme has numerous inelegances and limitations.  To name a few:
 Analysis DSP requires that audio be routed through to a sink, a "splitter"
mechanism must be used to share a source between multiple consumers, and
all units' audio formats must be fixed throughout their lifetimes.  It is
not directly possible to insert DSPs into a chain at run-time, or to
migrate the graph to a different stream.

Thinking about a supervised architecture, I can see how I might be able to
solve all these problems and more.  By building a rendering and state
manager, I can reduce the implementation burden involved in building new
DSPs, prioritize time-sensitive processing and skip unnecessary work.
Lastly, it could help me to build much more elegant and robust multi-steam
and multi-core processing mechanisms.

I would be very interested to hear from others who have used this type of
architecture:  Strengths, weaknesses, gotchas, et cetera.

– Evan Balster
creator of imitone 
___
dupswapdrop: music-dsp mailing list
music-dsp@music.columbia.edu
https://lists.columbia.edu/mailman/listinfo/music-dsp