Hello GSM community, I realize that most of you over in Osmocom land would much rather see me submit Gerrit patches than write lengthy ML posts, but right now I really need some help with the algorithmic logic of a feature before I can develop patches implementing said feature - so please bear with me.
The fundamental question is: what is the most correct way for a GSM network (let's ignore divisions between network elements for the moment) to construct the DL speech frame stream for call leg B if it is coming from the UL of call leg A? I am talking about call scenarios where call leg A and call leg B use the same codec, thus no transcoding is done (TrFO), and let me also further restrict this question to old-style FR/HR/EFR codecs, as opposed to AMR. At first the answer may seem so obvious that many people will probably wonder why I am asking such a silly question: just take the speech frame stream from call leg A UL, feed it to call leg B DL and be done with it, right? But the question is not so simple. What should the UL-to-DL mapper do when the UL stream hits a BFI instead of a valid speech frame? What should this mapper do if call leg A does DTXu but there is no DTXd on call leg B? The only place in 3GPP specs where I could find an answer to this question is TS 28.062 section C.3.2.1.1. Yes, I know that it's the spec for in-band TFO within G.711, a feature which I reason no one other than me probably cares about, but that particular section - I am talking about section C.3.2.1.1 specifically, you can ignore the rest of TFO for the purpose of this question - seems to me like it should apply to _any_ scenario where an FR/HR/EFR frame stream is directly passed from call leg A to call leg B without transcoding, including scenarios like a self-contained Osmocom network with OsmoMSC switching from one MS to another without any external MNCC. Let us first consider the case of FR1 codec, which is the simplest. Suppose call leg A has DTXu but call leg B has no DTXd - one can't do DTXd on C0, so if 200 kHz of spectrum is all you got, operating a BTS with just C0, then no one can do DTXd. When Alice on call leg A is silent, her MS will send a SID every 480 ms and have its Tx off the rest of the time, and the frame stream from the BTS serving her call leg will exhibit a SID frame in every 24th position and BFI placemarkers in all other positions. So what should the DL frame stream going to Bob look like in this scenario? My reading of section C.3.2.1.1 (second paragraph from the top is the one that covers this scenario) tells me that the *network* (set aside the question of which element) is supposed to turn that stream of BFIs with occasional interspersed SIDs into a stream of valid *speech* frames going to Bob, a stream of valid speech frames representing comfort noise as produced by a network-located CN generator. The spec says in that paragraph: "The Downlink TRAU Frames shall not contain the SID codeword, but parameters that allow a direct decoding." Needless to say, there is no code anywhere in Osmocom currently that does the above, thus current Osmocom is not able to produce the fancy TrFO behavior which the spec(s) seem to call for. (I said "spec(s)" vaguely because I only found a spec for TFO, not for TrFO, but I don't see any reason why this aspect of TFO spec shouldn't also apply to TrFO when the actual problem at hand is exactly the same.) But no no no guys, I am *not* bashing Osmocom here, I am seeking to improve it! As it happens, fully implementing the complete set of TS 28.062 section C.3.2.1.1 rules (I shall hereafter call them C3211 rules for short) for the original FR1 codec would be quite easy, and I already have a code implementation which I am eyeing to integrate into Osmocom. Themyscira libgsmfrp is a FLOSS library that implements a complete, spec-compliant Rx DTX handler for FR1, and it is 100% my own original work, not based on ETSI or TI or any other sources, thus no silly license issues - and I am eyeing the idea of integrating the same functions, appropriately renamed, repackaged and re-API-ed, into libosmocodec, and then invoking that functionality in OsmoBTS, in the code path that goes from RTP Rx to feeding TCH DL to PHY layers. But while FR1 is easy, doing the same for EFR is where the real difficulty lies, and this is the part where I come to the community for help. The key diff between FR1 and EFR that matters here is how their respective Rx DTX handlers are defined in the specs: for FR1 the Rx DTX handler is a separate piece, with the interface from this Rx DTX handler to the main body of the decoder being another 260-bit FR1 frame (this time without possibility of SID or BFI), and the specs for DTX (06.31 plus 06.11 and 06.12) define and describe the needed Rx DTX handler in terms of emitting that secondary 260-bit FR1 frame. Thus implementing this functionality in Themyscira libgsmfrp was a simple matter of taking the logic described in the specs and turning it into code. But for EFR the specs do not define the Rx DTX handler as a separate piece, instead it is integrated into the guts of the full decoder. There is a decoder, presented as published C source from ETSI, that takes a 244-bit EFR frame, which can be either speech or SID, *plus* a BFI flag as input, and emits a block of 160 PCM samples as output - all Rx DTX logic is buried inside, intertwined with the actual speech decoder operation, which is naturally quite complex. I've already spent a lot of time looking at the reference C implementation of EFR from ETSI - I kinda had to, as I did the rather substantial work of turning it into a usable function library, with state structures and a well-defined interface instead of global vars and namespace pollution - the result is Themyscira libgsmefr - but I am still nowhere closer to being able to implement C3211 functionality for this codec. The problem is this: starting with a EFR SID frame and previous history of a few speech frames (the hangover period), how would one produce output EFR speech frames (not SID) that represent comfort noise, as C3211 says is required? We can all easily look at ETSI's original code that generates CN as part of the standard decoder: but that code generates linear PCM output, not secondary EFR speech frames that represent CN. There is the main body of the speech decoder, and there are conditions throughout that slightly modify this decoder logic in subtle ways for CN generation and/or for ECU-style substitution/muting - but no guidance for how one could construct "valid speech" EFR frames that would produce a similar result when fed to the standard decoder in the MS after crossing radio leg B. This is where I could really use some input from more senior and more knowledgeable GSM-ers: does anyone know how mainstream commercial GSM infra vendors (particularly "ancient" ones of pure T1/E1 TDM kind) have solved this problem? What do _they_ do in the scenario of call leg A with DTXu turning into call leg B without DTXd? Given that those specs were written in the happy and glorious days when everyone used 2G, when GSM operators had lots of spectrum, and when most networks operated large multi-ARFCN BTSes with frequency hopping, I figure that almost everyone probably ran with DTXd enabled when that spec section was written - hence if I wonder if the authors of the TFO spec failed to appreciate the magnitude of what they were asking implementors to do when they stipulated that a UL-to-DL mapping from DTXu-on to DTXd-off "shall" emit no-SID speech frames that represent TFO-TRAU-generated CN. And if I wonder if the actual implementors ignored that stipulation even Back In The Day... Here is one way how we might be able to "cheat" - what if we implement a sort of fake DTXd in OsmoBTS for times when real DTXd is not possible because we only have C0? Here is what I mean: suppose the stream of TCH frames about to be sent to the PHY layer (perhaps the output of my proposed, to-be-implemented UL-to-DL mapper) is the kind that would be intended for DTXd-enabled DL in the original GSM architecture, with all speech pauses filled with repeated SIDs, every 20 ms without fail. A traditional DTXd BTS is supposed to transmit only those SIDs that either immediately follow a speech frame or fall in the SACCH-aligned always-Tx position, and turn the Tx off at other times. We can't actually turn off Tx at those "other" times when we are C0 - but what if we create a "fake DTXd" effect by transmitting a dummy FACCH containing an L2 fill frame at exactly the same times when we would do real DTXd if we could? The end effect will be that the spec-based Rx DTX handler in the MS will "see" the same "thing" as with real DTXd: receiving FACCH in all those "empty" 20 ms frame windows will cause that spec-based Rx DTX handler to get BFI=1, exactly the same as if radio Tx were truly off and the MS were listening to radio noise. Anyway, I would love to hear other people's thoughts on these ideas, especially if someone happens to know how traditional GSM infra vendors handled those pesky requirements of TS 28.062 section C.3.2.1.1 for UL-to-DL mapping. Sincerely, Your GSM-obsessed Mother Mychaela _______________________________________________ Community mailing list Community@freecalypso.org https://www.freecalypso.org/mailman/listinfo/community