On 2022-02-28, Fons Adriaensen wrote:
General purpose libraries doing efficient zero delay convolution using
multiple partition sizes (as suggested by Gardner) do exists. But they
are complete overkill for the simple problem at hand (decoding UHJ).
In a sense they are overkill, but in another they are not. Because *if*
you can do an easy API to a zero-delay convolution library, which also
admits full parallelization, and as the Gardner framework has it also
exhibits constant and asymptotically optimal computational load, the
thing is a no-worries plugin everybody can and will use. If you *don't*
have guarantees like that, developers won't adopt it for fear of
complexity, varying load which freezes your customer's phone up, because
the extra latency just doesn't work too well if a library is embedded in
a wider feedback loop (binaural following comes to mind, in gaming), or
just sheer average load.
Even the most basic FFT-based implentation wouldn't take more than one
percent CPU time on a ten year old PC or something like a Raspberry
Pi.
Indeed, and if you run such algorithms on a DSP, with say modulo
addressing and a proper streaming memory access pipeline, the hit can be
even less...
...unless you're trying something new, like fully general time variant
HOA reverberation. The computational load, especially if you have to do
a plug-in zero delay library, quickly multiplies. For instance, if you
do full periphonic third order reverberation, you have n=(m+1)^2
signals to contend with, which already at m=3 leaves you with 16
independent input and output signals, and as such 16^2==256
cross-connections, convolutions, between the components.
This is no longer a trivial load for any current processor, or even a
processor array. Not least because of the memory bandwidth and cache
contention concerns.
Of course sane engineering would call for the dimensionality problem to
be reduced. Say, do a multidimensional KLT transform or at least a DCT
before you process your signal set, and then cut off the extra
dimensions which cease to significantly contribute to the final
processing. As you know, they can be proven to fall off fast rather
generally, in this setup, with most musically relevant steady state
conditions.
Except they can *not* be proven to do so when the system function is
derived from geometrical-acoustical, time-variant reasons, such as is
the case in first-person-shooter games. Think about when you as the
shooter come from a narrow passage way into a broad arena.
The game's designer would typically have to model those spaces as
separate, and somehow interpolate between the solutions, because a
global solution to the acoustic field equation is *very* much beyond the
capability of *any* hardware out there. But then if you're doing third
order ambisonic, even the transitory rendered field is rather hard to
calculate. It's *especially* hard to calculate if you don't have a
zero-delay, no-nonsense, constant load-and-latency library at your
disposal, which does *not* require any extra calls to some weird
principal component machinery in order to work. (And you know, over
multidimensional signal sets, that machinery has *never* been put into a
constant or even limited effort framework, like Gardner did for zero
delay convolution; your game now freezes up because the audio guy
decided to optimize something in the long run, at you lose the
tournament.)
Some people think they can build a perpetuum mobile. Or extract
information out of nothing.
If you read my comments, I *never* do that. FFS, I'm a libertarian, much
of the Thatcher/Reagan sort: "There are no free lunches."
For each finite-order B-format, there are wildly different source
distributions that will produce exactly the same B-format signals.
Indeed. But the vast majority of them are physically implausible. For
instance, it is implausible that there would be more than tens of
thousands of point sources or more than a ten extended sources within
any any acoustical image captured by an ambisonic mic, or indeed
synthetically rendered into any feed. Certainly they won̈́'t be active at
the same time and with the precise same frequency distribution.
Which means that we can legitimately assume something about our signal
set which otherwise needn't be true. We can make a priori assumptions,
and then go from them to decoding solutions which otherwise would not be
legitimate. We can do what in other contexts is called
"superresolution". It's not "something from nothing", but rather
"something for something".
And we've been doing this the whole time ambisonic existed. Because the
classical first order decoding solution which lead us to them shelf
filters is already a superresolution one: it assumes a single
directional source, and two different metrics, from Makita theory.
Higher up the power criterion, lower down the linear interference
theory. But all of that in the context of a single source; it doesn't
consider multiple, arbitrary, possibly somewhat coherent sources. Why?
Because it can*not*; that immediately leads to an overcomplete problem,
where there is no possibility of enhancement beyond just playing back
the signal as it is, via a straight pseudoinverse.
Within that assumption, then, Harpex does a bit more: it takes into
consideration what would happen if we had two specular sources, and
inverts for that. DirAC takes a different approach: it's unprincipled in
its handling of specular sources, but also rather nice in that it
separates the time-quadrature, off-phase, resonant field from any
straight propagating components. (Here the discussion Angelo Farina
instigated years back is highly relevant: sound intensity theory
strongly suggests standing waves, i.e. the imaginary part of the Fourier
transform of the SoundField signal set, has to do with reverberation,
while the real part is about incident, propagating energy from a source.
The later analysis by the HOA crowd seems to concur, and to be pretty
much complete, after they also added spherical Hankel function in order
to deal with emanations outwards.)
It's quite easy to generate signals that will produce a completely
wrong output from Harpex or similar systems.
That's a linear inverse problem, and if I know anything about
mathematics, it's linear algebra. So yes, I could easily construct those
very counter-examples.
Not under an assumption of a specular source, though. Not under the
assumption of a directionally averaged, under a Makita metric, arbitrary
source, either.
So any parameteric (assuming that is what you call 'active') decoder
will have to start with some assumptions which may be wrong and will
be wrong at least part of the time.
Precisely correct. Active matrix is the same as parametric matrix.
Also the distinction plane wave vs. diffuse is a bit too simple.
Consider a choir (say 30 singers) distributed over a 90 degree
horizontal arc, singing unisono. If you want to decode this as 30
point sources you'll need very high order. But it's not diffuse
either. It's something in between, and a good parametric decoder
should be able to detect such source distributions.
Yes, but if by unisono you actually mean "fully in phase", or what on
the radio technology side would translate as "coherent", well, then
taken at a point at the center of the arc, it's easily proven the W
component is just the sum of the contributions of the various sources.
Any higher component is not, because of symmetricity considerations. But
they fall off exponentially by order, even here.
And of course you have to consider wavelength. Such an array singing "in
unison" leads different kinds of beaming based on frequency. If it emits
acoustical radiation in the 1Hz regime, and it's about 1-10m away,
obviously constructive interference will dominate the near field of the
observer. If it on the other hand emits ultrasonics in the 100000Hz
range, *and* if the emitters are widely spaced at 100m's away, the
nearfield of the observer will be much more complicated.
Nota bene, we've talked about this dependency on rig diameter on-list
before. Under the topic of Martin Leese's "Big Geese". How when you go
into the party ground with your four big ambisonic speakers, and
suddenly the "geese sounds" just don't sound the same, even if angularly
speaking it *ought to* be the same as in your living room. What gives?
What gives is is the spatial and as such directional aliasing limit. If
you had at your disposal a fully continuous-in-space-rig, it wouldn't
matter which bandwidths you're talking about. But since you're
reconstructing a simulacrum of the field using discrete sources, *does*
suddenly matter: they interfere with each other by their actual,
physical location, measured in the speed of sound. If you only have
four, it *does* matter over an extended area around the so called "sweet
spot" at the center, how far away those speakers really lay.
(That theory is fully if rather opaquely expressed in Daniel, Nicol and
Moreau's NFC-HOA papers; that's why they tell you there *has* to be a
spefific decoding radius as metadata; symmetricity doesn't cut it as an
argument. The extra niceness is that they deal with emanating waves as
well, using Hankel functions. Was it Bruce or who, now, who also dealt
with translated origins of these decompositions, in the so called
"O-format", and its processing -- I seem to remember these reduce to
Glebst-Gordan decompositions, from quantum mechanics. So that if you
follow the math, the ambisonic theory ought to be pretty much complete;
if not fully implemented, because of its final complexity.)
In practice such things become possible in practice if you start with
something like third or higher order. But the maths are by no means
trivial.
I'd in fact call the math pretty much trivial. It's mostly just about
LTI filtering, convolution. Its *optimized* form might not be quite as
nice, but the basic framework is just same-ol'.
That of course because Gerzon and the lot chose to do the framework so
that all kinds of quadrature and such apply. This goes to the age old
adage in math: knowing *how* to frame a problem goes most of the way to
solving it.
And if you have third or higher order, a corret linear decode will
provide very good results and there is little reason to go parametric.
DirAC shows otherwise. Most people haven't heard what it can do, and yet
thanks to Ville Pulkki, I have. It's amazing: even with a sparse
reconstruction array, you suddenly "don't hear the speakers".
It's a different kind of active decoding, and its treatment of specular
sources is far from principled (i.e. could be bettered to be sure, and
as the ambisonic kind of guy that I am, I find its utilization of Ville
Pulkki's VBAP to be highly suspect), but in its treatment of the
reactive, reverberant field...OMFG. It sounds *natural*, even with very
limited and anisotropic speaker resources. Truly! (I also believe they
have been developing the idea further at Aalto University; haven't
caught up with the newest, yet.)
I believe the principled way to go about this would be to treat the
field as complex, and harmonic, then to square it in order to find
point sources, express that instantaneous solution as a higher order
complex spherical harmonical expansion, extract the out-of-phase
component for DirAC-like processing, and to apply some time-running
polynomial of the adjugate of the system function to set a variable
time-frequency tradeoff.
Does anyone have a clue what this is supposed to mean ?
Prolly just me, so let me explain a bit still. The basic ideas of mine
run as follows.
First, every dynamic decoder matrix out there runs on some variation of
the same idea: you square the signal in order to find its power, then
you apply some filter over time in order to find sustained peaks
indicating a point/specular source, and after that you sharpen the image
by applying a matrix operation which attenuates anything perpendicular
to the detected signal. (Or in case of Harpex, two signals at the same
time. From periphonic first order B-format upto three could be detected
without too much confusion, but nobody has done so yet.)
This reasoning is often obscured by how various active matrices are in
practice implemented. For example, Dolby Surround/Pro Logic/MP/Stereo
does its steering thingy originally by "feeding opposite polarity
signals via an active matrix, in order to reduce crosstalk between the
L/R and C/B pairs". Yet in order to do so without affecting overall
sound intensity or without introducing too much skewing in the overall
soundfield, it necessarily reduces to a nonlinear controller, feeding an
orthogonal/unitary linear matrix.
(No other theory of quadraphonic active steering goes into the idea of
unitarity, except the ambisonic one; Gerzon mentions it, since his
theory necessarily requires it, but it's a one-off which isn't developed
further. That's pretty bad, because Dolby's implementation could be made
better if it was more readily known that as in UHJ decoding, instead of
just real-summing channels in order to get a 180 degree quadrature in
the back, surround channel, you could get a much nicer overall solution,
with phasing pushed into the back, and continous over the whole encoding
envelope.)
So why not do this in a principled fashion? Since you want to do a
second order operation (squaring, to get the power), start with an
ambisonic/spherical harmonical decomposition, and square it. You can do
that in closed form because you're talking about bandlimited discrete
functions in time, and harmonical polynomials in space. What you'll end
up with is two times oversampling in time, and twice the order of your
original spherical harmonical decomposition. Quite the price, but it
will be *exact*. No approximation, anywhere, and it will be capable of
representing *any* composition of sources, fully isotropically, and
without aliasing in time, just as the original signal set did.
From there you can impose a principled focusing operation: if the power
signal is focused, after you filter the norm via whatever auditory
sensitivity function you choose, you can steer the signal by applying a
matrix operation to it which keeps the averaged norm to the direction of
the detected signal the same, and attenuates/redistributes the power in
other directions the same. In effect keeping all of the power constant,
but sharpening in a certain direction. This is what all active decoders
do, only in my framework, it can be done exactly against any order of
ambisonics, *only* modulo numerical precision. It of course in general
goes to adjugates, trace theory and such, but it's just linear algebra;
an easy peace compared to the psychoacoustics which would be necessary
in order to construct the optimal open loop controller for something
like this.
Actually you *do* need Z. That's the point where I alluded to
Christoph Faller above: if you cut out the third dimension, your
reconstructed field will show a 1/r extra attenuation term from the
rig inwards, because you're bleeding off energy to the third
dimension.
You need the third dimension for realism. Not for correct decoding.
In fact you do. At least for extended area reconstruction.
Think about it. Suppose you have a far source which we can think as
exciting a plane wave at the origin. If you detect it and or reconstruct
it using point sources over a circle, you'll miss the fact that it has a
component in the third direction. Even if it's just that simplest
planewave, the sensing and the reconstruction will 1) miss that 1/r
attenuation Faller pointed out to me at his time, 1a) which has had to
have been compensated in the planar WFS literature already, 2) there's
hell to pay in sparse and not-too-regular/quadrature/point-like arrays
because of how Hyugens's principle works with them, yielding secondary
point radiators which interfere (unlike continuous arrays of the theory
would, especially in full 3D), and 3) the most interesting thingy: it's
*not* theoretically sufficient even to have a full 3D rig in order to
solve the propagating wave equation.
What you actually need in order to solve the system, is to have, as part
of your continuous rig, pointwise control of not just the pressure
field, but its normal derivative. That for a purely inwardly propagating
field from afar. If you also consider outwards propagation, you actually
need a "speaker" capable of doing everything a SoundField mic does at
the center of the array, at every point in 3D space around some closed
surface around the center point. Nothing else will do, if you want your
solution to the wave equation to converge over the whole area, even
given outwardly radiative, near-field solutions to the wave/acoustic
equation.
Which then means even in the reduced analysis that you must explicitly
tell *why* you only analyse the sound pressure field, and not the
vectorial velocity field too. You might not *want* to, but it's there,
and it's hugely relevant. Especially in regards to resonances, standing
waves, convergence of solutions of the acoustical equation to the very
edge of any given bounded, convex set, outwardly bound energy (even
given the Sommerfeld radiation condition), and the lot.
--
Sampo Syreeni, aka decoy - de...@iki.fi, http://decoy.iki.fi/front
+358-40-3751464, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
_______________________________________________
Sursound mailing list
Sursound@music.vt.edu
https://mail.music.vt.edu/mailman/listinfo/sursound - unsubscribe here, edit
account or options, view archives and so on.