Dear colleagues,
the MPEG has issued a CfP (Call for proposals) for MPEG-H part 3 (3D
Audio), also known as ISO/IEC 23008-3. (ISO/IEC 23008-2 is HEVC.)
Several documents referring to the (still early) standardization process
can be found on Leonardo Chiariglione's MPEG page:
http://mpeg.chiariglione.org/
The press release from MPEG 103 (last meeting) will find quite at the
top. At the end you will find three douments related to MPEG-H 3D audio.
In this posting I try to give some information about the standardization
process for 3D audio, within its wider context. Secondly, I would like
to give a few public commentaries about the present state of 3D Audio in
the cinema, and some challenges/problems the MPEG/ISO will have to face.
(And maybe even solve. :-) )
Thirdly, nobody has to read every document which I will present. (There
will be cited some pretty technical stuff.)
Reading the Mpeg documents (especially the "Call for Proposals"
document), it is obvious that they took the NHK/Hamasaki 22.2 system as
reference for a high-end home theater audio system (or cinema audio).
Some technical description of the "22.2 Multichannel Sound System for
Ultra High-Definition TV", presented by the direct proponents:
http://www.nhk.or.jp/digital/en/technical/pdf/IBC2007_08040907.pdf
The system is backward-compatible to 5.1, see (for example) fig. 4.
If I may include some critic: The results in fig. 6 look slightly
disappointing, if comparing the subjective evaluations of 22.2 and 5.1.
Mixing of music must be discussed from both
acoustic and artistic viewpoints. Reproduction of the
ambience of a concert hall in musical shows can be
easily discussed from both viewpoints. However,
mixing of pop music cannot be easily discussed
from the viewpoint of acoustics or physics, and the
sound design of TV programs using
three-dimensional sound systems very much
depends on personal artistic taste and the content of
the program.
My interpretation would be that the recording and mixing of real or
concert hall acoustics was/is not < that > easy, and very probably they
didn't make full use of the possibilities of the 22.2 system. (Just the
"3D" aspect should probably provide a clearer difference to
"conventional" 5.1 surround than they have measured.)
We have to be fair: The listening tests were based on early works on the
22.2 system.
We also thank the
staff of the NHK production operations center for their contribution
in producing the demonstration
content for the 22.2 multichannel sound system at World EXPO 2005 in
Aichi, Japan.
Now, it seems to be the case that the MPEG already has a certain plan
how 3D Audio "should" be coded. The technology "should re-use existing
MPEG technology wherever possible" (CfP, p. 5, at the very top).
Without given detailed reasons, I just state that this technology is
Spacial Audio Coding, according MPEG-D or ISO/IEC 23003.
http://mpeg.chiariglione.org/standards/mpeg-d
(Obviously AAC, HE-AAC, SBR, parametric coding etc. are included into
this framework.)
There exists a defined MPS codec , implemented into decoders.
Two (unfortunately older) articles about this stuff:
a)
http://infoscience.epfl.ch/record/54892/files/SPACE_AES119_v9.pdf
(MPEG surround, MPS)
b)
http://www.jeroenbreebaart.com/papers/aes/aes124.pdf
(SAOC)
Since an MPEG Surround (MPS) decoder serves as final
rendering engine, the task of transcoding consists in
combining SAOC parameters and rendering information
associated with each audio object to a standards compliant
MPS bitstream.
3.3.3. Binaural Transcoding
Headphones as reproduction device have gained significant
interest during the last decade.
Accordingly, the intended (?) MPEG approach might be based on a modified
MPS-SOAC approach, introducing the missing "3D" elements into MPS. (SOAC
already seems to be 3D-aware, if I am not misled by some sources. A
combination of a channel- and object-based approach is required by the
CfP. See p. 19, Test Material)
The role of HOA seems to be restricted as encoder input format.
Now, it seems to me that there are some clear omissions and problems in
the MPEG approach.
1. Why don't/won't you allow direct and channel-based encoding of
NHK/Hamasaki 22.2 and Auro-3D 11.1 via AAC?
If we take 64 kbit/(s*channel) as bitrate base, 22.2 can be coded in
1,536 MBit/s and 11.1 in 768 kbit/s. Both bitrates are known as typical
DTS bitrates, so hardly "excessive".
(Note that the two LFE channels can be coded very efficiently, because
they contain frequencies from 3Hz to 120Hz. So you can take a range from
0Hz - 200 Hz and code just this, or whatever other upper frequency limit
you prefer. LFE channels can be coded in very few bits, because LFE
spans less than 1% of the audible frequency range.)
2. If the aim is to code 3D audio, I see a very fundamental problem in
any "3D MPS-SAOC" and 22.2 "reference" channel-base approach: The
combination is clearly a half-sphere solution, not full-sphere!
I know that it is hard to install loudspeakers below the listeners or
spectators feet in a cinema, theatre or opera building. But this doesn't
mean there are no reflections, reverb or even direct sources from
vertical < down > directions, if you want to represent natural
acoustical environments.
If we talk about headphone-based representation of 3D audio via
headphones, nothing excludes to decode sounds which come from the lower
hemisphere. I could even give some good reasons that the lower
hemisphere is < more > important for 3D audio than the upper hemisphere.
In a cathedral acoustics, floor reflections are IMO about as important
than ceiling reflections. 3D reverb will have < up > and < down >
components. In a concerts hall acoustics, early reflections would tend
to come from the sides, and more from the floor than from the ceilings.
In open acoustics, floor reflections are important, but there is no
ceiling. So in "open" acoustics (think even of representation places
like amphitheatres), the lower hemisphere is probably more important for
3D audio than the upper one. (Ok: In the film sound case, Rambo might be
attacked by a helicopter, clearly an upper hemisphere event. But think
of the ground reflections, still important... ;-) )
Three-dimensional, immersive sound is not only about
discrete, individual sounds that originate from around and
above the listener (such as birds, airplanes, etc.), but more
important is the ability of the sound system to deliver the
sensation of being immersed by sound that originates from the
combination of both Lower and Height channels – accurately
reproducing the ambient sounds such as concert hall
reflections, the sound of leaves rustling in the forest or the
ambience of a busy market square.
http://www.barco.com/en/Auro11-1/Auro_11-1_explained/~/media/48FD91C7F9574A4280E49AA8C4CCA90E.pdf
(van Baelen, etc.)
Because 11.1 (Auro-3D) is also an "upper-hemisphere system" but not
full-sphere, the "sound of leaves rustling in the forest" can't be
reproduced. Q.e.d.
If talking about games, I believe we would have more direct sound from
below than from above. (Your steps. The cracking branches in the wood,
think of playing a very violent shooter game...But you just have to
defend your country, family... The car motor and transmission sounds in
your virtual Formel 1 car...)
The requirements for 3D audio in cinema/HT and music/games/VR (virtual
reality) might actually be quite different, at least in practice. In the
first case, it is hard to install some speakers in some elevation of say
"below -20º". In the cinema "progressive 3D film sound" case, the 3D
proponents will speak of the need for "height", but natural soundfields
and real 3D audio would also have to cover the lower hemisphere. (Direct
sources, reflections, and the reverb elements which come from down.)
I see two solutions: Include Ambisonics/HOA as permitted codec into the
MPEG-H 3D audio framework (even 1st order Ambisonics is full-sphere 3D
audio), OR include some virtual low speakers into the "3D MPS-SOAC"
framework.
You can also deny the problem and say that ISO/IEC 23008-3 is for
cinema/film audio, and nothing else. However, my impression is that "we"
originally wanted a general solution for 3D audio. And anyway, unsolved
issues will be covered by the competition.
3. I certainly believe that a parametric/object-based approach might be
the best for very low and maybe low bitrates, but maybe things might
never get perfect at any bitrate!
Exhibition A (parametric coding)
http://en.wikipedia.org/wiki/Parametric_stereo
Because only one audio channel is transmitted, along with the
parametric side info, a 24 kbit/s coded audio signal with Parametric
Stereo will be substantially improved in quality, compared to a
discretely stereo coded audio signal coded with conventional means.
Thus, the additional bitrate spent on the single mono channel
(combined with some PS side info) will improve the perceived quality
substantially of the audio compared to a standard stereo stream at
similar bitrate.
However, this technique is only useful at the lowest bitrates (say 16
- 32 kbit/s, sweet-spot at 24 kbit/s) to give a good stereo
impression, so while it can improve perceived quality at very low
bitrates, it generally does not achieve transparency
<http://en.wikipedia.org/wiki/Transparency_%28data_compression%29>,
since simulating the stereo dynamics of the audio with the technique
is limited and generally deteriorates perceived quality regardless of
the bitrate.
I admit that this is the most extreme case, because < any > directional
information would have to be described in parametric form.
Exhibit B
http://tech.ebu.ch/docs/tech/tech3324.pdf
Looking to the results in fig. 2, it is hard to claim that HE-AAC_MPS_64
(MPS: Mpeg Surround) or even HE-AAC_MPS_96 are good enough to be used
for broadcast, because the reproduction of at least < some > samples has
been rated to be poor.
Looking to fig. 4, HE-AAC "5.1" also had problems at any given bitrate.
Same for DD+ up to 256kbps. Interesting that L2_MPS_256 seems to be in
the good area, but nowhere excellent. Because the bitrate should be
sufficient to receive very good results if coding stereo in Mpeg-1 audio
layer II, the L2_MPS_256 example just proves that you can't code "the
rest" (= surround elements) just in parametric fashion.
It can be concluded that, at the moment, the MPEG HE-AAC seems to be
the most favourable
choice for a broadcaster requiring a good scalability of bitrate
versus quality, down to relatively
low bit rates.
The new coding systems with parametric coding of the surround
information of the various audio
channels, such as MPEG Surround, show a rather unbalanced behaviour
which depends on the type
of the test sequence. For the MPEG Surround codecs "applause" is again
the most critical sequence
resulting in only "Fair", or sometimes even "Poor" audio quality. It
can be concluded that the MPEG
Surround codecs at the moment do not fulfil the requirements for high
quality broadcasting at their
target bit rate, or do not offer the expected advantage in terms of
bitrate gain compared to
already well-established codecs.
4. Audio objects
Since long, I have defended that audio objects don't solve < every >
problem, because they won't sound really real. :-)
You can route some audio object to some 3D position, certainly. But you
won't receive the reflections and reverb of this virtual source. At
least I don't know any decoder which would calculate these things for a
given acoustic.... (Would have to be described via metadata, room or
"space" impulse responses, etc. Not very realistic to include this into
some 3D audio home receiver...)
The practical solution could be to split direct audio objects and their
reflections/reverb, the latter being transmitted in the < channel >
base. However, direct and reflected/reverberated sound of an audio
object might easily get detached. Even worse, this actually has to
happen in some circumstances. (Audio objects are laid as "direct
elements" to avaiable individual speaker positions, whereas their
reflections/reverb would stay in the 5.1/22.2 channel base. Note that a
cinema has certainly more than 6 speakers, if we talk about any
real-world 5.1 installation.)
If working in an "HOA studio environment" (which doesn't really exist,
but anyway), you would code audio objects and their reflections/reverb
into the soundfield.
In any case, I found some other opinion which seems to back my position.
(I actually didn't "copy" these views, I found this part by accident,
when reading about Auro-3D...)
http://www.barco.com/en/Auro11-1/Auro_11-1_explained/~/media/48FD91C7F9574A4280E49AA8C4CCA90E.pdf
Based on the principles described above, it can be understood
that the Object-based approach is especially powerful for the
reproduction of discrete sound sources that require precise
localization as well as movement. Examples are flying bullets,
cars passing by, a knock on a door… However, this ... works
best if these sources are recorded ʻdryʼ, without reverberation
or reflections.
In reality, however, such distinct, dry sources are the
exception rather than the rule; they are almost always
accompanied by ʻwetʼ or ambient sounds coming from the
objectʼs surroundings.
That is also the reason
why some systems apply a hybrid approach where ambient
sounds as well as music are Channel-based, while some
discrete sounds are then handled as objects.
In these cases, the audio objects are often recorded dry, while
the necessary reflections and reverberation to make the sound
life-like are added as pre-mixed channels.
But within some form of "3D MPS-SAOC", the pre-mixed channels containing
the "wet" indirects elements of any audio object will be coded in some
parametric way, and will this work? It won't, because the dry and wet
parts of an audio object (staying in this terminology) will get
detached... A parametric desription of reflection/reverb just won't keep
exact positions. At least the first reflections of audio objects would
require to have some precise position, I would assume.
5. I allow myself to < suggest > that Ambisonics/HOA might not just
serve as "encoder input" for the future Mpeg 3D audio standard, but
might actually be included as the full-sphere capable (= complete) 3D
audio codec. (You can "transcode" channel based input - like
5.1/11.1/22.2 - or audio objects with positions into Ambisonics/HOA
representation. In this way, Ambisonics/HOA could be both encoder input
and output.)
In any case, the Mpeg has to reflect if they want to work on a half- or
full-sphere version of 3D audio. (Ambisonics is since the 70s
full-sphere, "as everybody knows".)
6. CfP, p. 11 - 13, "Phase 2"
I don't know why we should code 3D audio in bitrated below 128 kbit/s.
In fact, I think this has been proven to be problematic, even in the
case of 5.1 surround!
http://tech.ebu.ch/docs/tech/tech3324.pdf
(Figure 2)
HEAAC_MPS_96 (5.1) doesn't seem to work "everywhere", and HEAAC_MPS_64
certainly not.
If anything, I would not assume that anything less than 128kbps stands a
chance to give the "immersive" experience 3D audio is supposed to deliver.
see 3), "Exhibit B"...
7. An Ambisonics/HOA based "codec" could scale from (CfP requirements)
256kbit/s to 1.2 MBit/s.
a) 256 kbit/s
1st order Ambisonics, 4 channels coded in AAC (64 kbit per channel and
second).
(LFE coding needs little space, see above. 2 LFE channels could be
incuded, because you don't need a lot of bits...)
Classical Ambisonics might still be a good and natural way to present 3D
audio via headphones. (The challenge to represent 3D audio via
headphones is not only related to the "3D codec", though.)
If listening to 3D audio via headphones, the decoding will be done for
the ideal listening position. Therefore, any 1st order problems related
to the sweet spot get much less important. (There are still some sweet
spot issues related to high frequencies, even for the headphone case.)
b) 512 kbit/s
3rd oder horizontal/1st order vertical. 8 channels coded in AAC/64kbit/s.
(If I understood Fons, I should call this form of mixec order
representation 3h1p. Right?)
c) 1.2 MBit/s
4th order horizontal/3rd order vertical, 18 channels. (9 horizontal and
9 "z"-related channels, and I guess this is "4h3p", not 4h3v. See above...)
18 * 64 kbit/s = 1152kbit/s. (2 LFE channels can be coded in few bits,
so you can easily respect the 1.2MBit/s limit.)
I think that this representation fits quite well to the 22.2 loudspeaker
layout. (You have the 10 speakers on the "ear level" plane, which are
spaced in some relatively regular fashion. 10 "ground" speakers fit to
the proposal as minimum number for horiz. 4th order. The two upper
layers- we have a 10 - 8 - 1 layout - seem to fit to some 3rd order HOA
layout. 35º is not 45º, but relatively close. I would not get too
irritated because of three low front speakers at -15º elevation, just
try 4h3p and decode well... Otherwise, go to 4h9p, if Fons even will
allow this combination... Not sure about that! :-D )
Such a mixed-order HOA "proposed codec" would represent the HOA Test
material very well (CfP, p. 19: most HOA samples are 4th order).
Therefore, he Ambisonics/HOA "codec proposal" probably would win the
RM0-HOA contest at 1.2 MBit/s (CfP, p. 7, p.11), and maybe the HOA tests
for 512 kbit/s. (Test 1.1 HOA, Test 1.2 HOA at 512kb/s, Test 1.3 512kb/s
for headphones, Test 1.4 HOA; CfP p.7-10)
8. There are ideas how to code HOA channels (or 1st order Ambisonics)
even more efficiently.
http://ro.uow.edu.au/cgi/viewcontent.cgi?article=8025&context=engpapers
For example, you might apply some light form of SBR if encoding 1st
order Ambisonics. In this case, you might be able to code 1st order in
HE-AAC in 192kbps, or even 160kbps. (Try to code the frequencies up to
16kHz in "classical" AAC fashion, and just the very high frequency range
via SBR. Might actually work quite well. The article describes possibly
more important savings in the case of HOA. For example, it might be
possible to code higher orders than 3h1p in the 512kbit/s case...)
9. I don't believe that the "Phase 2" requirements (coding of 3D audio
in 128kbit/s, 96kbit/s, 64 kbit/s and 48kbit/s) have any "big" real
world implications. We have seen that it is very hard to code even 5.1
(ITU) surround at these bitrates. Secondly, people stream video at much
higher bitrates, so what?!
On the other hand, they write (on p. 16) about home theater with 8k * 4k
resolution as "main scenario". (Good luck in buying some TV or projector
for this resolution at some non-hurting price - in 10 years, I might
add! :-D )
You won't deliver any "immersive" 3D audio at "Phase 2" bitrates. But at
least, don't combine immersive 8K UHD TV with immersive 3D audio at
96kbps. Bitte, don't do that...
10. The "proposed" Ambisonics/HOA based 3D-audio codec(s) could be
extended, in many ways.
- For example, you could add 2 or 3 direct front channels to proposal
7a, and the combined bitrate for B+ (and B++ ;-) would be 384kbit/s (or
448kbit/s). These bitrates are typical bitrates for Dolby Digital,
official DVD bitrates for DD.
- You could extend 4h3p easily to 6h3p, which is 22 full channels + 2
(heavily bandwidth-limited) LFE channels. Could be coded in < 1,5MBit/s,
if we stay with AAC and 64kbit/(s*channel).
6h3p seems actually to be balanced, because we know that the angular
front resolution of human hearing is much higher than the (angular)
vertical resolution. This might be related to simple anatomy. Both ears
lie in the same vertical plane - so it will be harder to "measure"
vertical differences than horizontal differences.
11. This posting is motivated by the fact that 3D audio for music/games
and VR should not be limited to some half-sphere representation,
otherwise you will miss reflections, reverb and actually direct sources
which exist in real-world environments/soundfields. (The last word has
been very carefully chosen. Mpeg loves "Mpeg Audio Technologies". I have
my own bias. 8-) )
I actually write very much from a musician's perspective. If you would
like to record music including natural acoustics, whole the idea of 3D
audio seems to be that you would not cancel some parts of the acoustics,
because it didn't fit into some "22.2" scheme! And if you couldn't find
a practical solution for playback of sounds/reflections/reverb from
below ear-plane, there already exists one solution. (Which you might
replace with some alternative, if you find something better within a
short time-frame...)
Said this, I believe the Mpeg should define its "3D MPS-SAOC" codec,
because this will be a valid generalization of Auro-3D/Dolby Atmos/22.2.
(ISO/IEC 23003 contains a lot of impressing stuff. I just wanted to give
some heavy reasons why you should not exaggerate the "low bitrate"
requirements for any future media standard. Available bitrates keep
rising, think of fiber and LTE data rates which are possible now and
today.... Therefore, go for the high quality and "immersive" impression
true 3D audio < should > deliver. If not, the customer/"consumer" might
stay with 5.1 anyway. Why should he/she invest in all these additional
loudspeakers /receivers/audio cards if even the best codec got
bit-starved... :-[ And immersive might also imply full-sphere. It might
be very noticable if some presented acoustical scene is < real > or
"band-limited"/direction-limited.)
If the (postulated) 3D MPS-SAOC is based on a 22.2 structure, you will
stay half-sphere. We can and should do better!
Best regards
Stefan Schreiber
_______________________________________________
Sursound mailing list
[email protected]
https://mail.music.vt.edu/mailman/listinfo/sursound