Dear colleagues,

the MPEG has issued a CfP (Call for proposals) for MPEG-H part 3 (3D Audio), also known as ISO/IEC 23008-3. (ISO/IEC 23008-2 is HEVC.)


Several documents referring to the (still early) standardization process can be found on Leonardo Chiariglione's MPEG page:

http://mpeg.chiariglione.org/
The press release from MPEG 103 (last meeting) will find quite at the top. At the end you will find three douments related to MPEG-H 3D audio.


In this posting I try to give some information about the standardization process for 3D audio, within its wider context. Secondly, I would like to give a few public commentaries about the present state of 3D Audio in the cinema, and some challenges/problems the MPEG/ISO will have to face. (And maybe even solve. :-) ) Thirdly, nobody has to read every document which I will present. (There will be cited some pretty technical stuff.)

Reading the Mpeg documents (especially the "Call for Proposals" document), it is obvious that they took the NHK/Hamasaki 22.2 system as reference for a high-end home theater audio system (or cinema audio).

Some technical description of the "22.2 Multichannel Sound System for Ultra High-Definition TV", presented by the direct proponents:

http://www.nhk.or.jp/digital/en/technical/pdf/IBC2007_08040907.pdf


The system is backward-compatible to 5.1, see (for example) fig. 4.

If I may include some critic: The results in fig. 6 look slightly disappointing, if comparing the subjective evaluations of 22.2 and 5.1.

Mixing of music must be discussed from both
acoustic and artistic viewpoints. Reproduction of the
ambience of a concert hall in musical shows can be
easily discussed from both viewpoints. However,
mixing of pop music cannot be easily discussed
from the viewpoint of acoustics or physics, and the
sound design of TV programs using
three-dimensional sound systems very much
depends on personal artistic taste and the content of
the program.


My interpretation would be that the recording and mixing of real or concert hall acoustics was/is not < that > easy, and very probably they didn't make full use of the possibilities of the 22.2 system. (Just the "3D" aspect should probably provide a clearer difference to "conventional" 5.1 surround than they have measured.)

We have to be fair: The listening tests were based on early works on the 22.2 system.

We also thank the
staff of the NHK production operations center for their contribution in producing the demonstration content for the 22.2 multichannel sound system at World EXPO 2005 in Aichi, Japan.



Now, it seems to be the case that the MPEG already has a certain plan how 3D Audio "should" be coded. The technology "should re-use existing MPEG technology wherever possible" (CfP, p. 5, at the very top).

Without given detailed reasons, I just state that this technology is Spacial Audio Coding, according MPEG-D or ISO/IEC 23003.

http://mpeg.chiariglione.org/standards/mpeg-d

(Obviously AAC, HE-AAC, SBR, parametric coding etc. are included into this framework.)

There exists a defined MPS codec , implemented into decoders.

Two (unfortunately older) articles about this stuff:

a)

http://infoscience.epfl.ch/record/54892/files/SPACE_AES119_v9.pdf

(MPEG surround, MPS)

b)

http://www.jeroenbreebaart.com/papers/aes/aes124.pdf

(SAOC)

Since an MPEG Surround (MPS) decoder serves as final
rendering engine, the task of transcoding consists in
combining SAOC parameters and rendering information
associated with each audio object to a standards compliant
MPS bitstream.


3.3.3. Binaural Transcoding
Headphones as reproduction device have gained significant
interest during the last decade.



Accordingly, the intended (?) MPEG approach might be based on a modified MPS-SOAC approach, introducing the missing "3D" elements into MPS. (SOAC already seems to be 3D-aware, if I am not misled by some sources. A combination of a channel- and object-based approach is required by the CfP. See p. 19, Test Material)

The role of HOA seems to be restricted as encoder input format.

Now, it seems to me that there are some clear omissions and problems in the MPEG approach.


1. Why don't/won't you allow direct and channel-based encoding of NHK/Hamasaki 22.2 and Auro-3D 11.1 via AAC?

If we take 64 kbit/(s*channel) as bitrate base, 22.2 can be coded in 1,536 MBit/s and 11.1 in 768 kbit/s. Both bitrates are known as typical DTS bitrates, so hardly "excessive".

(Note that the two LFE channels can be coded very efficiently, because they contain frequencies from 3Hz to 120Hz. So you can take a range from 0Hz - 200 Hz and code just this, or whatever other upper frequency limit you prefer. LFE channels can be coded in very few bits, because LFE spans less than 1% of the audible frequency range.)


2. If the aim is to code 3D audio, I see a very fundamental problem in any "3D MPS-SAOC" and 22.2 "reference" channel-base approach: The combination is clearly a half-sphere solution, not full-sphere!

I know that it is hard to install loudspeakers below the listeners or spectators feet in a cinema, theatre or opera building. But this doesn't mean there are no reflections, reverb or even direct sources from vertical < down > directions, if you want to represent natural acoustical environments.

If we talk about headphone-based representation of 3D audio via headphones, nothing excludes to decode sounds which come from the lower hemisphere. I could even give some good reasons that the lower hemisphere is < more > important for 3D audio than the upper hemisphere. In a cathedral acoustics, floor reflections are IMO about as important than ceiling reflections. 3D reverb will have < up > and < down > components. In a concerts hall acoustics, early reflections would tend to come from the sides, and more from the floor than from the ceilings. In open acoustics, floor reflections are important, but there is no ceiling. So in "open" acoustics (think even of representation places like amphitheatres), the lower hemisphere is probably more important for 3D audio than the upper one. (Ok: In the film sound case, Rambo might be attacked by a helicopter, clearly an upper hemisphere event. But think of the ground reflections, still important... ;-) )

Three-dimensional, immersive sound is not only about
discrete, individual sounds that originate from around and
above the listener (such as birds, airplanes, etc.), but more
important is the ability of the sound system to deliver the
sensation of being immersed by sound that originates from the
combination of both Lower and Height channels – accurately
reproducing the ambient sounds such as concert hall
reflections, the sound of leaves rustling in the forest or the
ambience of a busy market square.


http://www.barco.com/en/Auro11-1/Auro_11-1_explained/~/media/48FD91C7F9574A4280E49AA8C4CCA90E.pdf

(van Baelen, etc.)

Because 11.1 (Auro-3D) is also an "upper-hemisphere system" but not full-sphere, the "sound of leaves rustling in the forest" can't be reproduced. Q.e.d.


If talking about games, I believe we would have more direct sound from below than from above. (Your steps. The cracking branches in the wood, think of playing a very violent shooter game...But you just have to defend your country, family... The car motor and transmission sounds in your virtual Formel 1 car...)


The requirements for 3D audio in cinema/HT and music/games/VR (virtual reality) might actually be quite different, at least in practice. In the first case, it is hard to install some speakers in some elevation of say "below -20º". In the cinema "progressive 3D film sound" case, the 3D proponents will speak of the need for "height", but natural soundfields and real 3D audio would also have to cover the lower hemisphere. (Direct sources, reflections, and the reverb elements which come from down.)

I see two solutions: Include Ambisonics/HOA as permitted codec into the MPEG-H 3D audio framework (even 1st order Ambisonics is full-sphere 3D audio), OR include some virtual low speakers into the "3D MPS-SOAC" framework.

You can also deny the problem and say that ISO/IEC 23008-3 is for cinema/film audio, and nothing else. However, my impression is that "we" originally wanted a general solution for 3D audio. And anyway, unsolved issues will be covered by the competition.


3. I certainly believe that a parametric/object-based approach might be the best for very low and maybe low bitrates, but maybe things might never get perfect at any bitrate!

Exhibition A (parametric coding)

http://en.wikipedia.org/wiki/Parametric_stereo

Because only one audio channel is transmitted, along with the parametric side info, a 24 kbit/s coded audio signal with Parametric Stereo will be substantially improved in quality, compared to a discretely stereo coded audio signal coded with conventional means. Thus, the additional bitrate spent on the single mono channel (combined with some PS side info) will improve the perceived quality substantially of the audio compared to a standard stereo stream at similar bitrate.

However, this technique is only useful at the lowest bitrates (say 16 - 32 kbit/s, sweet-spot at 24 kbit/s) to give a good stereo impression, so while it can improve perceived quality at very low bitrates, it generally does not achieve transparency <http://en.wikipedia.org/wiki/Transparency_%28data_compression%29>, since simulating the stereo dynamics of the audio with the technique is limited and generally deteriorates perceived quality regardless of the bitrate.


I admit that this is the most extreme case, because < any > directional information would have to be described in parametric form.

Exhibit B

http://tech.ebu.ch/docs/tech/tech3324.pdf

Looking to the results in fig. 2, it is hard to claim that HE-AAC_MPS_64 (MPS: Mpeg Surround) or even HE-AAC_MPS_96 are good enough to be used for broadcast, because the reproduction of at least < some > samples has been rated to be poor. Looking to fig. 4, HE-AAC "5.1" also had problems at any given bitrate. Same for DD+ up to 256kbps. Interesting that L2_MPS_256 seems to be in the good area, but nowhere excellent. Because the bitrate should be sufficient to receive very good results if coding stereo in Mpeg-1 audio layer II, the L2_MPS_256 example just proves that you can't code "the rest" (= surround elements) just in parametric fashion.

It can be concluded that, at the moment, the MPEG HE-AAC seems to be the most favourable choice for a broadcaster requiring a good scalability of bitrate versus quality, down to relatively
low bit rates.

The new coding systems with parametric coding of the surround information of the various audio channels, such as MPEG Surround, show a rather unbalanced behaviour which depends on the type of the test sequence. For the MPEG Surround codecs "applause" is again the most critical sequence resulting in only "Fair", or sometimes even "Poor" audio quality. It can be concluded that the MPEG Surround codecs at the moment do not fulfil the requirements for high quality broadcasting at their target bit rate, or do not offer the expected advantage in terms of bitrate gain compared to
already well-established codecs.


4. Audio objects

Since long, I have defended that audio objects don't solve < every > problem, because they won't sound really real. :-)

You can route some audio object to some 3D position, certainly. But you won't receive the reflections and reverb of this virtual source. At least I don't know any decoder which would calculate these things for a given acoustic.... (Would have to be described via metadata, room or "space" impulse responses, etc. Not very realistic to include this into some 3D audio home receiver...)

The practical solution could be to split direct audio objects and their reflections/reverb, the latter being transmitted in the < channel > base. However, direct and reflected/reverberated sound of an audio object might easily get detached. Even worse, this actually has to happen in some circumstances. (Audio objects are laid as "direct elements" to avaiable individual speaker positions, whereas their reflections/reverb would stay in the 5.1/22.2 channel base. Note that a cinema has certainly more than 6 speakers, if we talk about any real-world 5.1 installation.)

If working in an "HOA studio environment" (which doesn't really exist, but anyway), you would code audio objects and their reflections/reverb into the soundfield.

In any case, I found some other opinion which seems to back my position. (I actually didn't "copy" these views, I found this part by accident, when reading about Auro-3D...)

http://www.barco.com/en/Auro11-1/Auro_11-1_explained/~/media/48FD91C7F9574A4280E49AA8C4CCA90E.pdf

Based on the principles described above, it can be understood
that the Object-based approach is especially powerful for the
reproduction of discrete sound sources that require precise
localization as well as movement. Examples are flying bullets,
cars passing by, a knock on a door… However, this ... works
best if these sources are recorded ʻdryʼ, without reverberation
or reflections.
In reality, however, such distinct, dry sources are the
exception rather than the rule; they are almost always
accompanied by ʻwetʼ or ambient sounds coming from the
objectʼs surroundings.


That is also the reason
why some systems apply a hybrid approach where ambient
sounds as well as music are Channel-based, while some
discrete sounds are then handled as objects.
In these cases, the audio objects are often recorded dry, while
the necessary reflections and reverberation to make the sound
life-like are added as pre-mixed channels.


But within some form of "3D MPS-SAOC", the pre-mixed channels containing the "wet" indirects elements of any audio object will be coded in some parametric way, and will this work? It won't, because the dry and wet parts of an audio object (staying in this terminology) will get detached... A parametric desription of reflection/reverb just won't keep exact positions. At least the first reflections of audio objects would require to have some precise position, I would assume.



5. I allow myself to < suggest > that Ambisonics/HOA might not just serve as "encoder input" for the future Mpeg 3D audio standard, but might actually be included as the full-sphere capable (= complete) 3D audio codec. (You can "transcode" channel based input - like 5.1/11.1/22.2 - or audio objects with positions into Ambisonics/HOA representation. In this way, Ambisonics/HOA could be both encoder input and output.)

In any case, the Mpeg has to reflect if they want to work on a half- or full-sphere version of 3D audio. (Ambisonics is since the 70s full-sphere, "as everybody knows".)


6. CfP, p. 11 - 13, "Phase 2"

I don't know why we should code 3D audio in bitrated below 128 kbit/s. In fact, I think this has been proven to be problematic, even in the case of 5.1 surround!

http://tech.ebu.ch/docs/tech/tech3324.pdf

(Figure 2)

HEAAC_MPS_96 (5.1) doesn't seem to work "everywhere", and HEAAC_MPS_64 certainly not.

If anything, I would not assume that anything less than 128kbps stands a chance to give the "immersive" experience 3D audio is supposed to deliver.

see 3), "Exhibit B"...


7. An Ambisonics/HOA based "codec" could scale from (CfP requirements) 256kbit/s to 1.2 MBit/s.

a) 256 kbit/s

1st order Ambisonics, 4 channels coded in AAC (64 kbit per channel and second).

(LFE coding needs little space, see above. 2 LFE channels could be incuded, because you don't need a lot of bits...)

Classical Ambisonics might still be a good and natural way to present 3D audio via headphones. (The challenge to represent 3D audio via headphones is not only related to the "3D codec", though.)

If listening to 3D audio via headphones, the decoding will be done for the ideal listening position. Therefore, any 1st order problems related to the sweet spot get much less important. (There are still some sweet spot issues related to high frequencies, even for the headphone case.)

b) 512 kbit/s

3rd oder horizontal/1st order vertical. 8 channels coded in AAC/64kbit/s.

(If I understood Fons, I should call this form of mixec order representation 3h1p. Right?)

c) 1.2 MBit/s

4th order horizontal/3rd order vertical, 18 channels. (9 horizontal and 9 "z"-related channels, and I guess this is "4h3p", not 4h3v. See above...)

18 * 64 kbit/s = 1152kbit/s. (2 LFE channels can be coded in few bits, so you can easily respect the 1.2MBit/s limit.)

I think that this representation fits quite well to the 22.2 loudspeaker layout. (You have the 10 speakers on the "ear level" plane, which are spaced in some relatively regular fashion. 10 "ground" speakers fit to the proposal as minimum number for horiz. 4th order. The two upper layers- we have a 10 - 8 - 1 layout - seem to fit to some 3rd order HOA layout. 35º is not 45º, but relatively close. I would not get too irritated because of three low front speakers at -15º elevation, just try 4h3p and decode well... Otherwise, go to 4h9p, if Fons even will allow this combination... Not sure about that! :-D )

Such a mixed-order HOA "proposed codec" would represent the HOA Test material very well (CfP, p. 19: most HOA samples are 4th order). Therefore, he Ambisonics/HOA "codec proposal" probably would win the RM0-HOA contest at 1.2 MBit/s (CfP, p. 7, p.11), and maybe the HOA tests for 512 kbit/s. (Test 1.1 HOA, Test 1.2 HOA at 512kb/s, Test 1.3 512kb/s for headphones, Test 1.4 HOA; CfP p.7-10)


8. There are ideas how to code HOA channels (or 1st order Ambisonics) even more efficiently.

http://ro.uow.edu.au/cgi/viewcontent.cgi?article=8025&context=engpapers

For example, you might apply some light form of SBR if encoding 1st order Ambisonics. In this case, you might be able to code 1st order in HE-AAC in 192kbps, or even 160kbps. (Try to code the frequencies up to 16kHz in "classical" AAC fashion, and just the very high frequency range via SBR. Might actually work quite well. The article describes possibly more important savings in the case of HOA. For example, it might be possible to code higher orders than 3h1p in the 512kbit/s case...)


9. I don't believe that the "Phase 2" requirements (coding of 3D audio in 128kbit/s, 96kbit/s, 64 kbit/s and 48kbit/s) have any "big" real world implications. We have seen that it is very hard to code even 5.1 (ITU) surround at these bitrates. Secondly, people stream video at much higher bitrates, so what?!

On the other hand, they write (on p. 16) about home theater with 8k * 4k resolution as "main scenario". (Good luck in buying some TV or projector for this resolution at some non-hurting price - in 10 years, I might add! :-D )

You won't deliver any "immersive" 3D audio at "Phase 2" bitrates. But at least, don't combine immersive 8K UHD TV with immersive 3D audio at 96kbps. Bitte, don't do that...


10. The "proposed" Ambisonics/HOA based 3D-audio codec(s) could be extended, in many ways.

- For example, you could add 2 or 3 direct front channels to proposal 7a, and the combined bitrate for B+ (and B++ ;-) would be 384kbit/s (or 448kbit/s). These bitrates are typical bitrates for Dolby Digital, official DVD bitrates for DD.

- You could extend 4h3p easily to 6h3p, which is 22 full channels + 2 (heavily bandwidth-limited) LFE channels. Could be coded in < 1,5MBit/s, if we stay with AAC and 64kbit/(s*channel). 6h3p seems actually to be balanced, because we know that the angular front resolution of human hearing is much higher than the (angular) vertical resolution. This might be related to simple anatomy. Both ears lie in the same vertical plane - so it will be harder to "measure" vertical differences than horizontal differences.


11. This posting is motivated by the fact that 3D audio for music/games and VR should not be limited to some half-sphere representation, otherwise you will miss reflections, reverb and actually direct sources which exist in real-world environments/soundfields. (The last word has been very carefully chosen. Mpeg loves "Mpeg Audio Technologies". I have my own bias. 8-) )


I actually write very much from a musician's perspective. If you would like to record music including natural acoustics, whole the idea of 3D audio seems to be that you would not cancel some parts of the acoustics, because it didn't fit into some "22.2" scheme! And if you couldn't find a practical solution for playback of sounds/reflections/reverb from below ear-plane, there already exists one solution. (Which you might replace with some alternative, if you find something better within a short time-frame...)

Said this, I believe the Mpeg should define its "3D MPS-SAOC" codec, because this will be a valid generalization of Auro-3D/Dolby Atmos/22.2. (ISO/IEC 23003 contains a lot of impressing stuff. I just wanted to give some heavy reasons why you should not exaggerate the "low bitrate" requirements for any future media standard. Available bitrates keep rising, think of fiber and LTE data rates which are possible now and today.... Therefore, go for the high quality and "immersive" impression true 3D audio < should > deliver. If not, the customer/"consumer" might stay with 5.1 anyway. Why should he/she invest in all these additional loudspeakers /receivers/audio cards if even the best codec got bit-starved... :-[ And immersive might also imply full-sphere. It might be very noticable if some presented acoustical scene is < real > or "band-limited"/direction-limited.)

If the (postulated) 3D MPS-SAOC is based on a 22.2 structure, you will stay half-sphere. We can and should do better!


Best regards

Stefan Schreiber









_______________________________________________
Sursound mailing list
[email protected]
https://mail.music.vt.edu/mailman/listinfo/sursound

Reply via email to