On 2022-02-28, Fons Adriaensen wrote:

General purpose libraries doing efficient zero delay convolution using multiple partition sizes (as suggested by Gardner) do exists. But they are complete overkill for the simple problem at hand (decoding UHJ).

In a sense they are overkill, but in another they are not. Because *if* you can do an easy API to a zero-delay convolution library, which also admits full parallelization, and as the Gardner framework has it also exhibits constant and asymptotically optimal computational load, the thing is a no-worries plugin everybody can and will use. If you *don't* have guarantees like that, developers won't adopt it for fear of complexity, varying load which freezes your customer's phone up, because the extra latency just doesn't work too well if a library is embedded in a wider feedback loop (binaural following comes to mind, in gaming), or just sheer average load.

Even the most basic FFT-based implentation wouldn't take more than one percent CPU time on a ten year old PC or something like a Raspberry Pi.

Indeed, and if you run such algorithms on a DSP, with say modulo addressing and a proper streaming memory access pipeline, the hit can be even less...

...unless you're trying something new, like fully general time variant HOA reverberation. The computational load, especially if you have to do a plug-in zero delay library, quickly multiplies. For instance, if you do full periphonic third order reverberation, you have n=(m+1)^2 signals to contend with, which already at m=3 leaves you with 16 independent input and output signals, and as such 16^2==256 cross-connections, convolutions, between the components.

This is no longer a trivial load for any current processor, or even a processor array. Not least because of the memory bandwidth and cache contention concerns.

Of course sane engineering would call for the dimensionality problem to be reduced. Say, do a multidimensional KLT transform or at least a DCT before you process your signal set, and then cut off the extra dimensions which cease to significantly contribute to the final processing. As you know, they can be proven to fall off fast rather generally, in this setup, with most musically relevant steady state conditions.

Except they can *not* be proven to do so when the system function is derived from geometrical-acoustical, time-variant reasons, such as is the case in first-person-shooter games. Think about when you as the shooter come from a narrow passage way into a broad arena.

The game's designer would typically have to model those spaces as separate, and somehow interpolate between the solutions, because a global solution to the acoustic field equation is *very* much beyond the capability of *any* hardware out there. But then if you're doing third order ambisonic, even the transitory rendered field is rather hard to calculate. It's *especially* hard to calculate if you don't have a zero-delay, no-nonsense, constant load-and-latency library at your disposal, which does *not* require any extra calls to some weird principal component machinery in order to work. (And you know, over multidimensional signal sets, that machinery has *never* been put into a constant or even limited effort framework, like Gardner did for zero delay convolution; your game now freezes up because the audio guy decided to optimize something in the long run, at you lose the tournament.)

Some people think they can build a perpetuum mobile. Or extract information out of nothing.

If you read my comments, I *never* do that. FFS, I'm a libertarian, much of the Thatcher/Reagan sort: "There are no free lunches."

For each finite-order B-format, there are wildly different source distributions that will produce exactly the same B-format signals.

Indeed. But the vast majority of them are physically implausible. For instance, it is implausible that there would be more than tens of thousands of point sources or more than a ten extended sources within any any acoustical image captured by an ambisonic mic, or indeed synthetically rendered into any feed. Certainly they won̈́'t be active at the same time and with the precise same frequency distribution.

Which means that we can legitimately assume something about our signal set which otherwise needn't be true. We can make a priori assumptions, and then go from them to decoding solutions which otherwise would not be legitimate. We can do what in other contexts is called "superresolution". It's not "something from nothing", but rather "something for something".

And we've been doing this the whole time ambisonic existed. Because the classical first order decoding solution which lead us to them shelf filters is already a superresolution one: it assumes a single directional source, and two different metrics, from Makita theory. Higher up the power criterion, lower down the linear interference theory. But all of that in the context of a single source; it doesn't consider multiple, arbitrary, possibly somewhat coherent sources. Why? Because it can*not*; that immediately leads to an overcomplete problem, where there is no possibility of enhancement beyond just playing back the signal as it is, via a straight pseudoinverse.

Within that assumption, then, Harpex does a bit more: it takes into consideration what would happen if we had two specular sources, and inverts for that. DirAC takes a different approach: it's unprincipled in its handling of specular sources, but also rather nice in that it separates the time-quadrature, off-phase, resonant field from any straight propagating components. (Here the discussion Angelo Farina instigated years back is highly relevant: sound intensity theory strongly suggests standing waves, i.e. the imaginary part of the Fourier transform of the SoundField signal set, has to do with reverberation, while the real part is about incident, propagating energy from a source. The later analysis by the HOA crowd seems to concur, and to be pretty much complete, after they also added spherical Hankel function in order to deal with emanations outwards.)

It's quite easy to generate signals that will produce a completely wrong output from Harpex or similar systems.

That's a linear inverse problem, and if I know anything about mathematics, it's linear algebra. So yes, I could easily construct those very counter-examples.

Not under an assumption of a specular source, though. Not under the assumption of a directionally averaged, under a Makita metric, arbitrary source, either.

So any parameteric (assuming that is what you call 'active') decoder will have to start with some assumptions which may be wrong and will be wrong at least part of the time.

Precisely correct. Active matrix is the same as parametric matrix.

Also the distinction plane wave vs. diffuse is a bit too simple. Consider a choir (say 30 singers) distributed over a 90 degree horizontal arc, singing unisono. If you want to decode this as 30 point sources you'll need very high order. But it's not diffuse either. It's something in between, and a good parametric decoder should be able to detect such source distributions.

Yes, but if by unisono you actually mean "fully in phase", or what on the radio technology side would translate as "coherent", well, then taken at a point at the center of the arc, it's easily proven the W component is just the sum of the contributions of the various sources. Any higher component is not, because of symmetricity considerations. But they fall off exponentially by order, even here.

And of course you have to consider wavelength. Such an array singing "in unison" leads different kinds of beaming based on frequency. If it emits acoustical radiation in the 1Hz regime, and it's about 1-10m away, obviously constructive interference will dominate the near field of the observer. If it on the other hand emits ultrasonics in the 100000Hz range, *and* if the emitters are widely spaced at 100m's away, the nearfield of the observer will be much more complicated.

Nota bene, we've talked about this dependency on rig diameter on-list before. Under the topic of Martin Leese's "Big Geese". How when you go into the party ground with your four big ambisonic speakers, and suddenly the "geese sounds" just don't sound the same, even if angularly speaking it *ought to* be the same as in your living room. What gives?

What gives is is the spatial and as such directional aliasing limit. If you had at your disposal a fully continuous-in-space-rig, it wouldn't matter which bandwidths you're talking about. But since you're reconstructing a simulacrum of the field using discrete sources, *does* suddenly matter: they interfere with each other by their actual, physical location, measured in the speed of sound. If you only have four, it *does* matter over an extended area around the so called "sweet spot" at the center, how far away those speakers really lay.

(That theory is fully if rather opaquely expressed in Daniel, Nicol and Moreau's NFC-HOA papers; that's why they tell you there *has* to be a spefific decoding radius as metadata; symmetricity doesn't cut it as an argument. The extra niceness is that they deal with emanating waves as well, using Hankel functions. Was it Bruce or who, now, who also dealt with translated origins of these decompositions, in the so called "O-format", and its processing -- I seem to remember these reduce to Glebst-Gordan decompositions, from quantum mechanics. So that if you follow the math, the ambisonic theory ought to be pretty much complete; if not fully implemented, because of its final complexity.)

In practice such things become possible in practice if you start with something like third or higher order. But the maths are by no means trivial.

I'd in fact call the math pretty much trivial. It's mostly just about LTI filtering, convolution. Its *optimized* form might not be quite as nice, but the basic framework is just same-ol'.

That of course because Gerzon and the lot chose to do the framework so that all kinds of quadrature and such apply. This goes to the age old adage in math: knowing *how* to frame a problem goes most of the way to solving it.

And if you have third or higher order, a corret linear decode will provide very good results and there is little reason to go parametric.

DirAC shows otherwise. Most people haven't heard what it can do, and yet thanks to Ville Pulkki, I have. It's amazing: even with a sparse reconstruction array, you suddenly "don't hear the speakers".

It's a different kind of active decoding, and its treatment of specular sources is far from principled (i.e. could be bettered to be sure, and as the ambisonic kind of guy that I am, I find its utilization of Ville Pulkki's VBAP to be highly suspect), but in its treatment of the reactive, reverberant field...OMFG. It sounds *natural*, even with very limited and anisotropic speaker resources. Truly! (I also believe they have been developing the idea further at Aalto University; haven't caught up with the newest, yet.)

I believe the principled way to go about this would be to treat the field as complex, and harmonic, then to square it in order to find point sources, express that instantaneous solution as a higher order complex spherical harmonical expansion, extract the out-of-phase component for DirAC-like processing, and to apply some time-running polynomial of the adjugate of the system function to set a variable time-frequency tradeoff.

Does anyone have a clue what this is supposed to mean ?

Prolly just me, so let me explain a bit still. The basic ideas of mine run as follows.

First, every dynamic decoder matrix out there runs on some variation of the same idea: you square the signal in order to find its power, then you apply some filter over time in order to find sustained peaks indicating a point/specular source, and after that you sharpen the image by applying a matrix operation which attenuates anything perpendicular to the detected signal. (Or in case of Harpex, two signals at the same time. From periphonic first order B-format upto three could be detected without too much confusion, but nobody has done so yet.)

This reasoning is often obscured by how various active matrices are in practice implemented. For example, Dolby Surround/Pro Logic/MP/Stereo does its steering thingy originally by "feeding opposite polarity signals via an active matrix, in order to reduce crosstalk between the L/R and C/B pairs". Yet in order to do so without affecting overall sound intensity or without introducing too much skewing in the overall soundfield, it necessarily reduces to a nonlinear controller, feeding an orthogonal/unitary linear matrix.

(No other theory of quadraphonic active steering goes into the idea of unitarity, except the ambisonic one; Gerzon mentions it, since his theory necessarily requires it, but it's a one-off which isn't developed further. That's pretty bad, because Dolby's implementation could be made better if it was more readily known that as in UHJ decoding, instead of just real-summing channels in order to get a 180 degree quadrature in the back, surround channel, you could get a much nicer overall solution, with phasing pushed into the back, and continous over the whole encoding envelope.)

So why not do this in a principled fashion? Since you want to do a second order operation (squaring, to get the power), start with an ambisonic/spherical harmonical decomposition, and square it. You can do that in closed form because you're talking about bandlimited discrete functions in time, and harmonical polynomials in space. What you'll end up with is two times oversampling in time, and twice the order of your original spherical harmonical decomposition. Quite the price, but it will be *exact*. No approximation, anywhere, and it will be capable of representing *any* composition of sources, fully isotropically, and without aliasing in time, just as the original signal set did.

From there you can impose a principled focusing operation: if the power signal is focused, after you filter the norm via whatever auditory sensitivity function you choose, you can steer the signal by applying a matrix operation to it which keeps the averaged norm to the direction of the detected signal the same, and attenuates/redistributes the power in other directions the same. In effect keeping all of the power constant, but sharpening in a certain direction. This is what all active decoders do, only in my framework, it can be done exactly against any order of ambisonics, *only* modulo numerical precision. It of course in general goes to adjugates, trace theory and such, but it's just linear algebra; an easy peace compared to the psychoacoustics which would be necessary in order to construct the optimal open loop controller for something like this.

Actually you *do* need Z. That's the point where I alluded to Christoph Faller above: if you cut out the third dimension, your reconstructed field will show a 1/r extra attenuation term from the rig inwards, because you're bleeding off energy to the third dimension.

You need the third dimension for realism. Not for correct decoding.

In fact you do. At least for extended area reconstruction.

Think about it. Suppose you have a far source which we can think as exciting a plane wave at the origin. If you detect it and or reconstruct it using point sources over a circle, you'll miss the fact that it has a component in the third direction. Even if it's just that simplest planewave, the sensing and the reconstruction will 1) miss that 1/r attenuation Faller pointed out to me at his time, 1a) which has had to have been compensated in the planar WFS literature already, 2) there's hell to pay in sparse and not-too-regular/quadrature/point-like arrays because of how Hyugens's principle works with them, yielding secondary point radiators which interfere (unlike continuous arrays of the theory would, especially in full 3D), and 3) the most interesting thingy: it's *not* theoretically sufficient even to have a full 3D rig in order to solve the propagating wave equation.

What you actually need in order to solve the system, is to have, as part of your continuous rig, pointwise control of not just the pressure field, but its normal derivative. That for a purely inwardly propagating field from afar. If you also consider outwards propagation, you actually need a "speaker" capable of doing everything a SoundField mic does at the center of the array, at every point in 3D space around some closed surface around the center point. Nothing else will do, if you want your solution to the wave equation to converge over the whole area, even given outwardly radiative, near-field solutions to the wave/acoustic equation.

Which then means even in the reduced analysis that you must explicitly tell *why* you only analyse the sound pressure field, and not the vectorial velocity field too. You might not *want* to, but it's there, and it's hugely relevant. Especially in regards to resonances, standing waves, convergence of solutions of the acoustical equation to the very edge of any given bounded, convex set, outwardly bound energy (even given the Sommerfeld radiation condition), and the lot.
--
Sampo Syreeni, aka decoy - de...@iki.fi, http://decoy.iki.fi/front
+358-40-3751464, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
_______________________________________________
Sursound mailing list
Sursound@music.vt.edu
https://mail.music.vt.edu/mailman/listinfo/sursound - unsubscribe here, edit 
account or options, view archives and so on.

Reply via email to