[Sursound] On next-generation surround sound (and Mpeg 3D Audio)

Stefan Schreiber Wed, 06 Mar 2013 20:21:57 -0800

Dear colleagues,

the MPEG has issued a CfP (Call for proposals) for MPEG-H part 3 (3DAudio), also known as ISO/IEC 23008-3. (ISO/IEC 23008-2 is HEVC.)

Several documents referring to the (still early) standardization processcan be found on Leonardo Chiariglione's MPEG page:


http://mpeg.chiariglione.org/

The press release from MPEG 103 (last meeting) will find quite at thetop. At the end you will find three douments related to MPEG-H 3D audio.

In this posting I try to give some information about the standardizationprocess for 3D audio, within its wider context. Secondly, I would liketo give a few public commentaries about the present state of 3D Audio inthe cinema, and some challenges/problems the MPEG/ISO will have to face.(And maybe even solve. :-) )Thirdly, nobody has to read every document which I will present. (Therewill be cited some pretty technical stuff.)

Reading the Mpeg documents (especially the "Call for Proposals"document), it is obvious that they took the NHK/Hamasaki 22.2 system asreference for a high-end home theater audio system (or cinema audio).

Some technical description of the "22.2 Multichannel Sound System forUltra High-Definition TV", presented by the direct proponents:


http://www.nhk.or.jp/digital/en/technical/pdf/IBC2007_08040907.pdf


The system is backward-compatible to 5.1, see (for example) fig. 4.

If I may include some critic: The results in fig. 6 look slightlydisappointing, if comparing the subjective evaluations of 22.2 and 5.1.

Mixing of music must be discussed from both
acoustic and artistic viewpoints. Reproduction of the
ambience of a concert hall in musical shows can be
easily discussed from both viewpoints. However,
mixing of pop music cannot be easily discussed
from the viewpoint of acoustics or physics, and the
sound design of TV programs using
three-dimensional sound systems very much
depends on personal artistic taste and the content of
the program.

My interpretation would be that the recording and mixing of real orconcert hall acoustics was/is not < that > easy, and very probably theydidn't make full use of the possibilities of the 22.2 system. (Just the"3D" aspect should probably provide a clearer difference to"conventional" 5.1 surround than they have measured.)

We have to be fair: The listening tests were based on early works on the22.2 system.

We also thank the
staff of the NHK production operations center for their contributionin producing the demonstrationcontent for the 22.2 multichannel sound system at World EXPO 2005 inAichi, Japan.

Now, it seems to be the case that the MPEG already has a certain planhow 3D Audio "should" be coded. The technology "should re-use existingMPEG technology wherever possible" (CfP, p. 5, at the very top).

Without given detailed reasons, I just state that this technology isSpacial Audio Coding, according MPEG-D or ISO/IEC 23003.


http://mpeg.chiariglione.org/standards/mpeg-d

(Obviously AAC, HE-AAC, SBR, parametric coding etc. are included intothis framework.)


There exists a defined MPS codec , implemented into decoders.

Two (unfortunately older) articles about this stuff:

a)

http://infoscience.epfl.ch/record/54892/files/SPACE_AES119_v9.pdf

(MPEG surround, MPS)

b)

http://www.jeroenbreebaart.com/papers/aes/aes124.pdf

(SAOC)

Since an MPEG Surround (MPS) decoder serves as final
rendering engine, the task of transcoding consists in
combining SAOC parameters and rendering information
associated with each audio object to a standards compliant
MPS bitstream.

3.3.3. Binaural Transcoding
Headphones as reproduction device have gained significant
interest during the last decade.

Accordingly, the intended (?) MPEG approach might be based on a modifiedMPS-SOAC approach, introducing the missing "3D" elements into MPS. (SOACalready seems to be 3D-aware, if I am not misled by some sources. Acombination of a channel- and object-based approach is required by theCfP. See p. 19, Test Material)


The role of HOA seems to be restricted as encoder input format.

Now, it seems to me that there are some clear omissions and problems inthe MPEG approach.

1. Why don't/won't you allow direct and channel-based encoding ofNHK/Hamasaki 22.2 and Auro-3D 11.1 via AAC?

If we take 64 kbit/(s*channel) as bitrate base, 22.2 can be coded in1,536 MBit/s and 11.1 in 768 kbit/s. Both bitrates are known as typicalDTS bitrates, so hardly "excessive".

(Note that the two LFE channels can be coded very efficiently, becausethey contain frequencies from 3Hz to 120Hz. So you can take a range from0Hz - 200 Hz and code just this, or whatever other upper frequency limityou prefer. LFE channels can be coded in very few bits, because LFEspans less than 1% of the audible frequency range.)

2. If the aim is to code 3D audio, I see a very fundamental problem inany "3D MPS-SAOC" and 22.2 "reference" channel-base approach: Thecombination is clearly a half-sphere solution, not full-sphere!

I know that it is hard to install loudspeakers below the listeners orspectators feet in a cinema, theatre or opera building. But this doesn'tmean there are no reflections, reverb or even direct sources fromvertical < down > directions, if you want to represent naturalacoustical environments.

If we talk about headphone-based representation of 3D audio viaheadphones, nothing excludes to decode sounds which come from the lowerhemisphere. I could even give some good reasons that the lowerhemisphere is < more > important for 3D audio than the upper hemisphere.In a cathedral acoustics, floor reflections are IMO about as importantthan ceiling reflections. 3D reverb will have < up > and < down >components. In a concerts hall acoustics, early reflections would tendto come from the sides, and more from the floor than from the ceilings.In open acoustics, floor reflections are important, but there is noceiling. So in "open" acoustics (think even of representation placeslike amphitheatres), the lower hemisphere is probably more important for3D audio than the upper one. (Ok: In the film sound case, Rambo might beattacked by a helicopter, clearly an upper hemisphere event. But thinkof the ground reflections, still important... ;-) )

Three-dimensional, immersive sound is not only about
discrete, individual sounds that originate from around and
above the listener (such as birds, airplanes, etc.), but more
important is the ability of the sound system to deliver the
sensation of being immersed by sound that originates from the
combination of both Lower and Height channels – accurately
reproducing the ambient sounds such as concert hall
reflections, the sound of leaves rustling in the forest or the
ambience of a busy market square.



http://www.barco.com/en/Auro11-1/Auro_11-1_explained/~/media/48FD91C7F9574A4280E49AA8C4CCA90E.pdf

(van Baelen, etc.)

Because 11.1 (Auro-3D) is also an "upper-hemisphere system" but notfull-sphere, the "sound of leaves rustling in the forest" can't bereproduced. Q.e.d.

If talking about games, I believe we would have more direct sound frombelow than from above. (Your steps. The cracking branches in the wood,think of playing a very violent shooter game...But you just have todefend your country, family... The car motor and transmission sounds inyour virtual Formel 1 car...)

The requirements for 3D audio in cinema/HT and music/games/VR (virtualreality) might actually be quite different, at least in practice. In thefirst case, it is hard to install some speakers in some elevation of say"below -20º". In the cinema "progressive 3D film sound" case, the 3Dproponents will speak of the need for "height", but natural soundfieldsand real 3D audio would also have to cover the lower hemisphere. (Directsources, reflections, and the reverb elements which come from down.)

I see two solutions: Include Ambisonics/HOA as permitted codec into theMPEG-H 3D audio framework (even 1st order Ambisonics is full-sphere 3Daudio), OR include some virtual low speakers into the "3D MPS-SOAC"framework.

You can also deny the problem and say that ISO/IEC 23008-3 is forcinema/film audio, and nothing else. However, my impression is that "we"originally wanted a general solution for 3D audio. And anyway, unsolvedissues will be covered by the competition.

3. I certainly believe that a parametric/object-based approach might bethe best for very low and maybe low bitrates, but maybe things mightnever get perfect at any bitrate!


Exhibition A (parametric coding)

http://en.wikipedia.org/wiki/Parametric_stereo

Because only one audio channel is transmitted, along with theparametric side info, a 24 kbit/s coded audio signal with ParametricStereo will be substantially improved in quality, compared to adiscretely stereo coded audio signal coded with conventional means.Thus, the additional bitrate spent on the single mono channel(combined with some PS side info) will improve the perceived qualitysubstantially of the audio compared to a standard stereo stream atsimilar bitrate.
However, this technique is only useful at the lowest bitrates (say 16- 32 kbit/s, sweet-spot at 24 kbit/s) to give a good stereoimpression, so while it can improve perceived quality at very lowbitrates, it generally does not achieve transparency<http://en.wikipedia.org/wiki/Transparency_%28data_compression%29>,since simulating the stereo dynamics of the audio with the techniqueis limited and generally deteriorates perceived quality regardless ofthe bitrate.

I admit that this is the most extreme case, because < any > directionalinformation would have to be described in parametric form.


Exhibit B

http://tech.ebu.ch/docs/tech/tech3324.pdf

Looking to the results in fig. 2, it is hard to claim that HE-AAC_MPS_64(MPS: Mpeg Surround) or even HE-AAC_MPS_96 are good enough to be usedfor broadcast, because the reproduction of at least < some > samples hasbeen rated to be poor.Looking to fig. 4, HE-AAC "5.1" also had problems at any given bitrate.Same for DD+ up to 256kbps. Interesting that L2_MPS_256 seems to be inthe good area, but nowhere excellent. Because the bitrate should besufficient to receive very good results if coding stereo in Mpeg-1 audiolayer II, the L2_MPS_256 example just proves that you can't code "therest" (= surround elements) just in parametric fashion.

It can be concluded that, at the moment, the MPEG HE-AAC seems to bethe most favourablechoice for a broadcaster requiring a good scalability of bitrateversus quality, down to relatively
low bit rates.

The new coding systems with parametric coding of the surroundinformation of the various audiochannels, such as MPEG Surround, show a rather unbalanced behaviourwhich depends on the typeof the test sequence. For the MPEG Surround codecs "applause" is againthe most critical sequenceresulting in only "Fair", or sometimes even "Poor" audio quality. Itcan be concluded that the MPEGSurround codecs at the moment do not fulfil the requirements for highquality broadcasting at theirtarget bit rate, or do not offer the expected advantage in terms ofbitrate gain compared to
already well-established codecs.



4. Audio objects

Since long, I have defended that audio objects don't solve < every >problem, because they won't sound really real. :-)

You can route some audio object to some 3D position, certainly. But youwon't receive the reflections and reverb of this virtual source. Atleast I don't know any decoder which would calculate these things for agiven acoustic.... (Would have to be described via metadata, room or"space" impulse responses, etc. Not very realistic to include this intosome 3D audio home receiver...)

The practical solution could be to split direct audio objects and theirreflections/reverb, the latter being transmitted in the < channel >base. However, direct and reflected/reverberated sound of an audioobject might easily get detached. Even worse, this actually has tohappen in some circumstances. (Audio objects are laid as "directelements" to avaiable individual speaker positions, whereas theirreflections/reverb would stay in the 5.1/22.2 channel base. Note that acinema has certainly more than 6 speakers, if we talk about anyreal-world 5.1 installation.)

If working in an "HOA studio environment" (which doesn't really exist,but anyway), you would code audio objects and their reflections/reverbinto the soundfield.

In any case, I found some other opinion which seems to back my position.(I actually didn't "copy" these views, I found this part by accident,when reading about Auro-3D...)


http://www.barco.com/en/Auro11-1/Auro_11-1_explained/~/media/48FD91C7F9574A4280E49AA8C4CCA90E.pdf

Based on the principles described above, it can be understood
that the Object-based approach is especially powerful for the
reproduction of discrete sound sources that require precise
localization as well as movement. Examples are flying bullets,
cars passing by, a knock on a door… However, this ... works
best if these sources are recorded ʻdryʼ, without reverberation
or reflections.
In reality, however, such distinct, dry sources are the
exception rather than the rule; they are almost always
accompanied by ʻwetʼ or ambient sounds coming from the
objectʼs surroundings.

That is also the reason
why some systems apply a hybrid approach where ambient
sounds as well as music are Channel-based, while some
discrete sounds are then handled as objects.
In these cases, the audio objects are often recorded dry, while
the necessary reflections and reverberation to make the sound
life-like are added as pre-mixed channels.

But within some form of "3D MPS-SAOC", the pre-mixed channels containingthe "wet" indirects elements of any audio object will be coded in someparametric way, and will this work? It won't, because the dry and wetparts of an audio object (staying in this terminology) will getdetached... A parametric desription of reflection/reverb just won't keepexact positions. At least the first reflections of audio objects wouldrequire to have some precise position, I would assume.

5. I allow myself to < suggest > that Ambisonics/HOA might not justserve as "encoder input" for the future Mpeg 3D audio standard, butmight actually be included as the full-sphere capable (= complete) 3Daudio codec. (You can "transcode" channel based input - like5.1/11.1/22.2 - or audio objects with positions into Ambisonics/HOArepresentation. In this way, Ambisonics/HOA could be both encoder inputand output.)

In any case, the Mpeg has to reflect if they want to work on a half- orfull-sphere version of 3D audio. (Ambisonics is since the 70sfull-sphere, "as everybody knows".)



6. CfP, p. 11 - 13, "Phase 2"

I don't know why we should code 3D audio in bitrated below 128 kbit/s.In fact, I think this has been proven to be problematic, even in thecase of 5.1 surround!


http://tech.ebu.ch/docs/tech/tech3324.pdf

(Figure 2)

HEAAC_MPS_96 (5.1) doesn't seem to work "everywhere", and HEAAC_MPS_64certainly not.

If anything, I would not assume that anything less than 128kbps stands achance to give the "immersive" experience 3D audio is supposed to deliver.


see 3), "Exhibit B"...

7. An Ambisonics/HOA based "codec" could scale from (CfP requirements)256kbit/s to 1.2 MBit/s.


a) 256 kbit/s

1st order Ambisonics, 4 channels coded in AAC (64 kbit per channel andsecond).

(LFE coding needs little space, see above. 2 LFE channels could beincuded, because you don't need a lot of bits...)

Classical Ambisonics might still be a good and natural way to present 3Daudio via headphones. (The challenge to represent 3D audio viaheadphones is not only related to the "3D codec", though.)

If listening to 3D audio via headphones, the decoding will be done forthe ideal listening position. Therefore, any 1st order problems relatedto the sweet spot get much less important. (There are still some sweetspot issues related to high frequencies, even for the headphone case.)


b) 512 kbit/s

3rd oder horizontal/1st order vertical. 8 channels coded in AAC/64kbit/s.

(If I understood Fons, I should call this form of mixec orderrepresentation 3h1p. Right?)


c) 1.2 MBit/s

4th order horizontal/3rd order vertical, 18 channels. (9 horizontal and9 "z"-related channels, and I guess this is "4h3p", not 4h3v. See above...)

18 * 64 kbit/s = 1152kbit/s. (2 LFE channels can be coded in few bits,so you can easily respect the 1.2MBit/s limit.)

I think that this representation fits quite well to the 22.2 loudspeakerlayout. (You have the 10 speakers on the "ear level" plane, which arespaced in some relatively regular fashion. 10 "ground" speakers fit tothe proposal as minimum number for horiz. 4th order. The two upperlayers- we have a 10 - 8 - 1 layout - seem to fit to some 3rd order HOAlayout. 35º is not 45º, but relatively close. I would not get tooirritated because of three low front speakers at -15º elevation, justtry 4h3p and decode well... Otherwise, go to 4h9p, if Fons even willallow this combination... Not sure about that! :-D )

Such a mixed-order HOA "proposed codec" would represent the HOA Testmaterial very well (CfP, p. 19: most HOA samples are 4th order).Therefore, he Ambisonics/HOA "codec proposal" probably would win theRM0-HOA contest at 1.2 MBit/s (CfP, p. 7, p.11), and maybe the HOA testsfor 512 kbit/s. (Test 1.1 HOA, Test 1.2 HOA at 512kb/s, Test 1.3 512kb/sfor headphones, Test 1.4 HOA; CfP p.7-10)

8. There are ideas how to code HOA channels (or 1st order Ambisonics)even more efficiently.


http://ro.uow.edu.au/cgi/viewcontent.cgi?article=8025&context=engpapers

For example, you might apply some light form of SBR if encoding 1storder Ambisonics. In this case, you might be able to code 1st order inHE-AAC in 192kbps, or even 160kbps. (Try to code the frequencies up to16kHz in "classical" AAC fashion, and just the very high frequency rangevia SBR. Might actually work quite well. The article describes possiblymore important savings in the case of HOA. For example, it might bepossible to code higher orders than 3h1p in the 512kbit/s case...)

9. I don't believe that the "Phase 2" requirements (coding of 3D audioin 128kbit/s, 96kbit/s, 64 kbit/s and 48kbit/s) have any "big" realworld implications. We have seen that it is very hard to code even 5.1(ITU) surround at these bitrates. Secondly, people stream video at muchhigher bitrates, so what?!

On the other hand, they write (on p. 16) about home theater with 8k * 4kresolution as "main scenario". (Good luck in buying some TV or projectorfor this resolution at some non-hurting price - in 10 years, I mightadd! :-D )

You won't deliver any "immersive" 3D audio at "Phase 2" bitrates. But atleast, don't combine immersive 8K UHD TV with immersive 3D audio at96kbps. Bitte, don't do that...

10. The "proposed" Ambisonics/HOA based 3D-audio codec(s) could beextended, in many ways.

- For example, you could add 2 or 3 direct front channels to proposal7a, and the combined bitrate for B+ (and B++ ;-) would be 384kbit/s (or448kbit/s). These bitrates are typical bitrates for Dolby Digital,official DVD bitrates for DD.

- You could extend 4h3p easily to 6h3p, which is 22 full channels + 2(heavily bandwidth-limited) LFE channels. Could be coded in < 1,5MBit/s,if we stay with AAC and 64kbit/(s*channel).6h3p seems actually to be balanced, because we know that the angularfront resolution of human hearing is much higher than the (angular)vertical resolution. This might be related to simple anatomy. Both earslie in the same vertical plane - so it will be harder to "measure"vertical differences than horizontal differences.

11. This posting is motivated by the fact that 3D audio for music/gamesand VR should not be limited to some half-sphere representation,otherwise you will miss reflections, reverb and actually direct sourceswhich exist in real-world environments/soundfields. (The last word hasbeen very carefully chosen. Mpeg loves "Mpeg Audio Technologies". I havemy own bias. 8-) )

I actually write very much from a musician's perspective. If you wouldlike to record music including natural acoustics, whole the idea of 3Daudio seems to be that you would not cancel some parts of the acoustics,because it didn't fit into some "22.2" scheme! And if you couldn't finda practical solution for playback of sounds/reflections/reverb frombelow ear-plane, there already exists one solution. (Which you mightreplace with some alternative, if you find something better within ashort time-frame...)

Said this, I believe the Mpeg should define its "3D MPS-SAOC" codec,because this will be a valid generalization of Auro-3D/Dolby Atmos/22.2.(ISO/IEC 23003 contains a lot of impressing stuff. I just wanted to givesome heavy reasons why you should not exaggerate the "low bitrate"requirements for any future media standard. Available bitrates keeprising, think of fiber and LTE data rates which are possible now andtoday.... Therefore, go for the high quality and "immersive" impressiontrue 3D audio < should > deliver. If not, the customer/"consumer" mightstay with 5.1 anyway. Why should he/she invest in all these additionalloudspeakers /receivers/audio cards if even the best codec gotbit-starved... :-[ And immersive might also imply full-sphere. It mightbe very noticable if some presented acoustical scene is < real > or"band-limited"/direction-limited.)

If the (postulated) 3D MPS-SAOC is based on a 22.2 structure, you willstay half-sphere. We can and should do better!



Best regards

Stefan Schreiber









_______________________________________________
Sursound mailing list
[email protected]
https://mail.music.vt.edu/mailman/listinfo/sursound

[Sursound] On next-generation surround sound (and Mpeg 3D Audio)

Reply via email to