Re: [Sursound] Ambisonic UHJ Stereo decoder to speaker feeds

Sampo Syreeni Tue, 01 Mar 2022 18:18:38 -0800

On 2022-02-28, Fons Adriaensen wrote:

General purpose libraries doing efficient zero delay convolution usingmultiple partition sizes (as suggested by Gardner) do exists. But theyare complete overkill for the simple problem at hand (decoding UHJ).

In a sense they are overkill, but in another they are not. Because *if*you can do an easy API to a zero-delay convolution library, which alsoadmits full parallelization, and as the Gardner framework has it alsoexhibits constant and asymptotically optimal computational load, thething is a no-worries plugin everybody can and will use. If you *don't*have guarantees like that, developers won't adopt it for fear ofcomplexity, varying load which freezes your customer's phone up, becausethe extra latency just doesn't work too well if a library is embedded ina wider feedback loop (binaural following comes to mind, in gaming), orjust sheer average load.

Even the most basic FFT-based implentation wouldn't take more than onepercent CPU time on a ten year old PC or something like a RaspberryPi.

Indeed, and if you run such algorithms on a DSP, with say moduloaddressing and a proper streaming memory access pipeline, the hit can beeven less...

...unless you're trying something new, like fully general time variantHOA reverberation. The computational load, especially if you have to doa plug-in zero delay library, quickly multiplies. For instance, if youdo full periphonic third order reverberation, you have n=(m+1)^2signals to contend with, which already at m=3 leaves you with 16independent input and output signals, and as such 16^2==256cross-connections, convolutions, between the components.

This is no longer a trivial load for any current processor, or even aprocessor array. Not least because of the memory bandwidth and cachecontention concerns.

Of course sane engineering would call for the dimensionality problem tobe reduced. Say, do a multidimensional KLT transform or at least a DCTbefore you process your signal set, and then cut off the extradimensions which cease to significantly contribute to the finalprocessing. As you know, they can be proven to fall off fast rathergenerally, in this setup, with most musically relevant steady stateconditions.

Except they can *not* be proven to do so when the system function isderived from geometrical-acoustical, time-variant reasons, such as isthe case in first-person-shooter games. Think about when you as theshooter come from a narrow passage way into a broad arena.

The game's designer would typically have to model those spaces asseparate, and somehow interpolate between the solutions, because aglobal solution to the acoustic field equation is *very* much beyond thecapability of *any* hardware out there. But then if you're doing thirdorder ambisonic, even the transitory rendered field is rather hard tocalculate. It's *especially* hard to calculate if you don't have azero-delay, no-nonsense, constant load-and-latency library at yourdisposal, which does *not* require any extra calls to some weirdprincipal component machinery in order to work. (And you know, overmultidimensional signal sets, that machinery has *never* been put into aconstant or even limited effort framework, like Gardner did for zerodelay convolution; your game now freezes up because the audio guydecided to optimize something in the long run, at you lose thetournament.)

Some people think they can build a perpetuum mobile. Or extractinformation out of nothing.

If you read my comments, I *never* do that. FFS, I'm a libertarian, muchof the Thatcher/Reagan sort: "There are no free lunches."

For each finite-order B-format, there are wildly different sourcedistributions that will produce exactly the same B-format signals.

Indeed. But the vast majority of them are physically implausible. Forinstance, it is implausible that there would be more than tens ofthousands of point sources or more than a ten extended sources withinany any acoustical image captured by an ambisonic mic, or indeedsynthetically rendered into any feed. Certainly they won̈́'t be active atthe same time and with the precise same frequency distribution.

Which means that we can legitimately assume something about our signalset which otherwise needn't be true. We can make a priori assumptions,and then go from them to decoding solutions which otherwise would not belegitimate. We can do what in other contexts is called"superresolution". It's not "something from nothing", but rather"something for something".

And we've been doing this the whole time ambisonic existed. Because theclassical first order decoding solution which lead us to them shelffilters is already a superresolution one: it assumes a singledirectional source, and two different metrics, from Makita theory.Higher up the power criterion, lower down the linear interferencetheory. But all of that in the context of a single source; it doesn'tconsider multiple, arbitrary, possibly somewhat coherent sources. Why?Because it can*not*; that immediately leads to an overcomplete problem,where there is no possibility of enhancement beyond just playing backthe signal as it is, via a straight pseudoinverse.

Within that assumption, then, Harpex does a bit more: it takes intoconsideration what would happen if we had two specular sources, andinverts for that. DirAC takes a different approach: it's unprincipled inits handling of specular sources, but also rather nice in that itseparates the time-quadrature, off-phase, resonant field from anystraight propagating components. (Here the discussion Angelo Farinainstigated years back is highly relevant: sound intensity theorystrongly suggests standing waves, i.e. the imaginary part of the Fouriertransform of the SoundField signal set, has to do with reverberation,while the real part is about incident, propagating energy from a source.The later analysis by the HOA crowd seems to concur, and to be prettymuch complete, after they also added spherical Hankel function in orderto deal with emanations outwards.)

It's quite easy to generate signals that will produce a completelywrong output from Harpex or similar systems.

That's a linear inverse problem, and if I know anything aboutmathematics, it's linear algebra. So yes, I could easily construct thosevery counter-examples.

Not under an assumption of a specular source, though. Not under theassumption of a directionally averaged, under a Makita metric, arbitrarysource, either.

So any parameteric (assuming that is what you call 'active') decoderwill have to start with some assumptions which may be wrong and willbe wrong at least part of the time.


Precisely correct. Active matrix is the same as parametric matrix.

Also the distinction plane wave vs. diffuse is a bit too simple.Consider a choir (say 30 singers) distributed over a 90 degreehorizontal arc, singing unisono. If you want to decode this as 30point sources you'll need very high order. But it's not diffuseeither. It's something in between, and a good parametric decodershould be able to detect such source distributions.

Yes, but if by unisono you actually mean "fully in phase", or what onthe radio technology side would translate as "coherent", well, thentaken at a point at the center of the arc, it's easily proven the Wcomponent is just the sum of the contributions of the various sources.Any higher component is not, because of symmetricity considerations. Butthey fall off exponentially by order, even here.

And of course you have to consider wavelength. Such an array singing "inunison" leads different kinds of beaming based on frequency. If it emitsacoustical radiation in the 1Hz regime, and it's about 1-10m away,obviously constructive interference will dominate the near field of theobserver. If it on the other hand emits ultrasonics in the 100000Hzrange, *and* if the emitters are widely spaced at 100m's away, thenearfield of the observer will be much more complicated.

Nota bene, we've talked about this dependency on rig diameter on-listbefore. Under the topic of Martin Leese's "Big Geese". How when you gointo the party ground with your four big ambisonic speakers, andsuddenly the "geese sounds" just don't sound the same, even if angularlyspeaking it *ought to* be the same as in your living room. What gives?

What gives is is the spatial and as such directional aliasing limit. Ifyou had at your disposal a fully continuous-in-space-rig, it wouldn'tmatter which bandwidths you're talking about. But since you'rereconstructing a simulacrum of the field using discrete sources, *does*suddenly matter: they interfere with each other by their actual,physical location, measured in the speed of sound. If you only havefour, it *does* matter over an extended area around the so called "sweetspot" at the center, how far away those speakers really lay.

(That theory is fully if rather opaquely expressed in Daniel, Nicol andMoreau's NFC-HOA papers; that's why they tell you there *has* to be aspefific decoding radius as metadata; symmetricity doesn't cut it as anargument. The extra niceness is that they deal with emanating waves aswell, using Hankel functions. Was it Bruce or who, now, who also dealtwith translated origins of these decompositions, in the so called"O-format", and its processing -- I seem to remember these reduce toGlebst-Gordan decompositions, from quantum mechanics. So that if youfollow the math, the ambisonic theory ought to be pretty much complete;if not fully implemented, because of its final complexity.)

In practice such things become possible in practice if you start withsomething like third or higher order. But the maths are by no meanstrivial.

I'd in fact call the math pretty much trivial. It's mostly just aboutLTI filtering, convolution. Its *optimized* form might not be quite asnice, but the basic framework is just same-ol'.

That of course because Gerzon and the lot chose to do the framework sothat all kinds of quadrature and such apply. This goes to the age oldadage in math: knowing *how* to frame a problem goes most of the way tosolving it.

And if you have third or higher order, a corret linear decode willprovide very good results and there is little reason to go parametric.

DirAC shows otherwise. Most people haven't heard what it can do, and yetthanks to Ville Pulkki, I have. It's amazing: even with a sparsereconstruction array, you suddenly "don't hear the speakers".

It's a different kind of active decoding, and its treatment of specularsources is far from principled (i.e. could be bettered to be sure, andas the ambisonic kind of guy that I am, I find its utilization of VillePulkki's VBAP to be highly suspect), but in its treatment of thereactive, reverberant field...OMFG. It sounds *natural*, even with verylimited and anisotropic speaker resources. Truly! (I also believe theyhave been developing the idea further at Aalto University; haven'tcaught up with the newest, yet.)

I believe the principled way to go about this would be to treat thefield as complex, and harmonic, then to square it in order to findpoint sources, express that instantaneous solution as a higher ordercomplex spherical harmonical expansion, extract the out-of-phasecomponent for DirAC-like processing, and to apply some time-runningpolynomial of the adjugate of the system function to set a variabletime-frequency tradeoff.
Does anyone have a clue what this is supposed to mean ?

Prolly just me, so let me explain a bit still. The basic ideas of minerun as follows.

First, every dynamic decoder matrix out there runs on some variation ofthe same idea: you square the signal in order to find its power, thenyou apply some filter over time in order to find sustained peaksindicating a point/specular source, and after that you sharpen the imageby applying a matrix operation which attenuates anything perpendicularto the detected signal. (Or in case of Harpex, two signals at the sametime. From periphonic first order B-format upto three could be detectedwithout too much confusion, but nobody has done so yet.)

This reasoning is often obscured by how various active matrices are inpractice implemented. For example, Dolby Surround/Pro Logic/MP/Stereodoes its steering thingy originally by "feeding opposite polaritysignals via an active matrix, in order to reduce crosstalk between theL/R and C/B pairs". Yet in order to do so without affecting overallsound intensity or without introducing too much skewing in the overallsoundfield, it necessarily reduces to a nonlinear controller, feeding anorthogonal/unitary linear matrix.

(No other theory of quadraphonic active steering goes into the idea ofunitarity, except the ambisonic one; Gerzon mentions it, since histheory necessarily requires it, but it's a one-off which isn't developedfurther. That's pretty bad, because Dolby's implementation could be madebetter if it was more readily known that as in UHJ decoding, instead ofjust real-summing channels in order to get a 180 degree quadrature inthe back, surround channel, you could get a much nicer overall solution,with phasing pushed into the back, and continous over the whole encodingenvelope.)

So why not do this in a principled fashion? Since you want to do asecond order operation (squaring, to get the power), start with anambisonic/spherical harmonical decomposition, and square it. You can dothat in closed form because you're talking about bandlimited discretefunctions in time, and harmonical polynomials in space. What you'll endup with is two times oversampling in time, and twice the order of youroriginal spherical harmonical decomposition. Quite the price, but itwill be *exact*. No approximation, anywhere, and it will be capable ofrepresenting *any* composition of sources, fully isotropically, andwithout aliasing in time, just as the original signal set did.

From there you can impose a principled focusing operation: if the powersignal is focused, after you filter the norm via whatever auditorysensitivity function you choose, you can steer the signal by applying amatrix operation to it which keeps the averaged norm to the direction ofthe detected signal the same, and attenuates/redistributes the power inother directions the same. In effect keeping all of the power constant,but sharpening in a certain direction. This is what all active decodersdo, only in my framework, it can be done exactly against any order ofambisonics, *only* modulo numerical precision. It of course in generalgoes to adjugates, trace theory and such, but it's just linear algebra;an easy peace compared to the psychoacoustics which would be necessaryin order to construct the optimal open loop controller for somethinglike this.

Actually you *do* need Z. That's the point where I alluded toChristoph Faller above: if you cut out the third dimension, yourreconstructed field will show a 1/r extra attenuation term from therig inwards, because you're bleeding off energy to the thirddimension.
You need the third dimension for realism. Not for correct decoding.


In fact you do. At least for extended area reconstruction.

Think about it. Suppose you have a far source which we can think asexciting a plane wave at the origin. If you detect it and or reconstructit using point sources over a circle, you'll miss the fact that it has acomponent in the third direction. Even if it's just that simplestplanewave, the sensing and the reconstruction will 1) miss that 1/rattenuation Faller pointed out to me at his time, 1a) which has had tohave been compensated in the planar WFS literature already, 2) there'shell to pay in sparse and not-too-regular/quadrature/point-like arraysbecause of how Hyugens's principle works with them, yielding secondarypoint radiators which interfere (unlike continuous arrays of the theorywould, especially in full 3D), and 3) the most interesting thingy: it's*not* theoretically sufficient even to have a full 3D rig in order tosolve the propagating wave equation.

What you actually need in order to solve the system, is to have, as partof your continuous rig, pointwise control of not just the pressurefield, but its normal derivative. That for a purely inwardly propagatingfield from afar. If you also consider outwards propagation, you actuallyneed a "speaker" capable of doing everything a SoundField mic does atthe center of the array, at every point in 3D space around some closedsurface around the center point. Nothing else will do, if you want yoursolution to the wave equation to converge over the whole area, evengiven outwardly radiative, near-field solutions to the wave/acousticequation.

Which then means even in the reduced analysis that you must explicitlytell *why* you only analyse the sound pressure field, and not thevectorial velocity field too. You might not *want* to, but it's there,and it's hugely relevant. Especially in regards to resonances, standingwaves, convergence of solutions of the acoustical equation to the veryedge of any given bounded, convex set, outwardly bound energy (evengiven the Sommerfeld radiation condition), and the lot.

--
Sampo Syreeni, aka decoy - de...@iki.fi, http://decoy.iki.fi/front
+358-40-3751464, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
_______________________________________________
Sursound mailing list
Sursound@music.vt.edu
https://mail.music.vt.edu/mailman/listinfo/sursound - unsubscribe here, edit 
account or options, view archives and so on.

Re: [Sursound] Ambisonic UHJ Stereo decoder to speaker feeds

Reply via email to