Re: [Telrad] Uplink throughput again

Nathan Anderson Thu, 09 Mar 2017 14:48:07 -0800

Your post comes at an interesting time.

For the last few weeks, we have been fighting with Telrad engineering over 
multiple issues.  The vast majority of them have been on the EPC side (many of 
the bugs have been crashers or otherwise extremely service-affecting!), some of 
which I have detailed in past posts here.  Fortunately, within the last couple 
of weeks we have actually made significant progress with these and the latest 
EPC code is the most stable we have seen in some time (though there are still 
one or two outstanding issues; fortunately they do not seem to rear their heads 
very often so by-and-large things are running smoothly).


I am still unconvinced that the dips I screenshotted a while back as recorded 
by my realtime SNMP grapher are "real" and not just a function of the utility I 
am using.  As such I haven't spent much time chasing that particular issue 
down.  However, on the eNB side we have definitely run into the exact same 
issue as your #2 in your list.  So far, we have only noticed it on the eNBs 
that are running the pre-release code, and we have only upgraded our 3 most 
heavily-loaded eNBs to this release while the rest remain on 6.6 GA (4013).  
(If the other eNBs were more loaded, we might feel pressure to upgrade them 
past GA in order to resolve the upload performance issues that exist with the 
GA release but which seems to largely be dealt with in the pre-release code.  
As things stand, though, those sectors seem to be performing fine, so until 
they either start seeing more traffic/UEs or I know that this particular 
problem has been licked in an as-yet-unreleased eNB firmware, I will hold off 
on further upgrades.)

We also seem to be fortunate in that it sounds like it is happening much less 
often to us than it is to others (you).  We ran pre-release eNB code for a good 
2 weeks before we first encountered the issue (or, at least before we actually 
noticed it).  We discovered resetting RF interface cured it (S1 reset, quickly 
toggle spectrum analyzer on/off, etc.; reboot takes way too long).  It didn't 
happen again for a couple of days but then happened 2 days in a row on the same 
eNB.  During one of those times, we managed to get one of the Israeli support 
guys to remote in and investigate, collect logs, etc., but then after we 
thought he was done and we reset the S1 to that eNB, he said there were a few 
more things he forgot to collect and to let them know when it happens again.

And of course, it hasn't!  (Actually not true...it happened on a second sector, 
but I knew he wasn't going to be around at that very moment and I wanted to 
test and see if kicking all UEs off worked just as well as a full S1 reset, and 
sure enough it did.  It simply hasn't happened again since then, so there 
hasn't been another opportunity for them to remote in and collect more data.)  
So in our case, it can sometimes be several days in between incidences.  Also, 
sometimes we will notice (by reviewing BreezeView KPIs) that it has actually 
occurred but then self-correct with no action on our part, sometimes in as 
little as 5-10 minutes (1-2 samples).

The guy I was interacting with wouldn't commit to a yes or no when I 
point-blank asked him if this was "yet another bug" we were chasing (I wanted 
to say "yet another @#$@#$%@ bug" but I bit my tongue ;-)).  But given that it 
is now clear (thanks to your post) that others are seeing the exact same thing, 
I'm sure they know that they have another problem on their hands (as they CAN'T 
just have heard about this from the two of us), as elusive as it might be 
(impossible to reproduce at-will, etc.).

I, too, share your view of things: frustrated with the state of the product and 
support infrastructure as a whole, but unwilling to pin blame on any individual 
I have been in contact with.  At some level, I empathize with them because I 
have been in similar positions, chasing problems that are elusive and hard to 
reproduce while at the same time having (legitimately) upset customers beating 
you up over them.  It *really* sucks, and I am *sure* that they acutely feel 
the pressure to get these things fixed.  At the same time, the sheer quantity 
of issues we have experienced over the time we have owned this gear is somewhat 
staggering, and often it seems like we trade one issue in for another, which 
makes applying upgrades a scary prospect ("what new regression will we end up 
fighting with this version?").  Over the last few months, we have scheduled 
maintenance windows more times than I can count at the drop of a hat, often 
several times in a given week, and have allowed bleeding-edge/hot off the 
presses code to touch our *production* infrastructure, in essence allowing 
Telrad to use our network as a guinea pig so that we can aid their engineers as 
they work to reproduce these issues (since many of them have not been 
reproducible in a lab.)  I personally have lost countless hours of sleep and 
built up a tremendous sleep debt over maintaining this system, and have fallen 
behind on other duties (as well as life in general) as a result.  I am trying 
not to sound snippy here, but it's getting to the point where I'm seriously 
considering asking the question of what sort of compensation we should expect 
to get in return for all of this.

At the same time, perhaps partly because I see real progress, and partly 
because I think I'm largely an optimist by nature, I still hold out hope that 
things are eventually going to work how they ought.  I can't remember where I 
heard or read this, but my impression is that the Compact's WiMAX firmware was 
in a similar state for a good while when it was first introduced, but is now 
regarded as rock-solid.  So they are probably just in a similar stage of the 
development and evolution of the LTE product.  That doesn't make it any less 
frustrating, though, that we seem to have been caught in this particular stage 
for as long as we have.

Sadly, I won't be at WISPAmerica.  If you manage to get some productive face 
time with Telrad there, I'd love to hear about it afterward.

-- Nathan

From: [email protected] [mailto:[email protected]] On Behalf Of 
Jeremy Austin
Sent: Thursday, March 09, 2017 1:53 PM
To: [email protected]
Subject: Re: [Telrad] Uplink throughput again

On the other hand, the dips are back for us again. This is getting to be very 
wearing.

To recap:
1) We are running the prerelease code
2) We have been having to reset S1/reboot ENBs periodically (multiple times a 
day on one particular sector) due to a state of stuck high RF usage
3) The dips are back (60 second cycle, significant drop in throughput for about 
7 seconds)
4) Otherwise, throughput is performing *better than ever*

We are now going on 8 full months with failure to resolve these. (To be fair, 
other manufacturers can take a while to fix things as well.)

Customers are complaining.

Telrad has confirmed (multiple times) that there is nothing wrong with our 
network/setup/UEs. They have confirmed that we have done every single thing we 
can do to verify that performance issues are *not* our problem, but Telrad's.

However, replacing Telrad is not an option at present.

I doubt this falls under any lemon laws, but I can only describe our experience 
of failures as systematic of core issues with the Telrad business. 
Individually, I take no issue with either the American or Israeli support team.

Collectively, however, we have a significant problem. There is no 24/7 NOC/TAC… 
and even if there were, the fact that   we probably couldn't rouse an engineer 
to observe/collect data when the trouble is occurring is a serious defect.

I'm looking forward to seeing the Telrad team at Wispamerica — but with 
extremely mixed feelings about the support experience. I have attempted to burn 
no bridges, but at the same time be very clear about what I perceive as a 
systematic failure to deliver as promised.

I have been fairly quiet on list about our outstanding issues, thinking that 
they would be better solved by superior troubleshooting and Telrad engineering 
than by social engineering.

Perhaps it is time for that to change. Perhaps I am doing a disservice to other 
Telrad customers by keeping quiet.

Thoughts?

On Thu, Feb 16, 2017 at 2:40 AM, Nathan Anderson 
<[email protected]<mailto:[email protected]>> wrote:
Ugh, this is what I get for jumping to conclusions and running my mouth off 
before doing just the slightest bit of investigation.

I think it might somehow just be the tool I'm using to do the graphing.  If I 
watch one of the active bandwidth tests closely while also watching the graph 
of the eNB that UE is attached to, I don't (always) see the same dips.

Sooo, false alarm.  Possibly.  I'll keep watching things and report back.

If it's just a graphing error/anomaly, not sure what the problem would be here. 
 Both the tool and the switch that the eNBs are plugged into supposedly support 
SNMP v2c, so we shouldn't be overrunning a 32-bit integer.

-- Nathan

From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]<mailto:[email protected]>] On Behalf Of 
Adam Moffett
Sent: Thursday, February 16, 2017 2:18 AM

To: [email protected]<mailto:[email protected]>
Subject: Re: [Telrad] Uplink throughput again

Interesting.

------ Original Message ------
From: "Nathan Anderson" <[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Sent: 2/16/2017 4:24:00 AM
Subject: Re: [Telrad] Uplink throughput again

Jeremy mentioned his periodic traffic dips to me recently off-list.  I haven't 
seen anything exactly like what either of you two are talking about, 
but...attached is an interesting screenshot I just took of downlink usage on 3 
separate eNBs on our network, each of which I am currently saturating 
(off-hours) with MT download bandwidth test (occurring behind 1 UE on each 
sector, and each UE has been temporarily granted 100Mbit downlink AMBR).

Notice the little icicle-like formations?  Also notice how they seem to be 
fairly regular, and also seem to occur at the exact same interval on every 
sector, but don't perfectly line up with each other?

WTF is *that* about?

-- Nathan

From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]<mailto:[email protected]>] On Behalf Of 
Jeremy Austin
Sent: Wednesday, February 15, 2017 8:44 PM
To: Adam Moffett; [email protected]<mailto:[email protected]>
Subject: Re: [Telrad] Uplink throughput again

Adam, I'm going to assume that no other traffic on the same equipment (sans EPC 
and ENB) show this periodicity?

I have seen something in the same ballpark, but not identical, since August. I 
have been planning to post it to the list to get more eyes on it (after letting 
Telrad have some time to look at it first).

Just wanted to check that you had isolated the behavior entirely to LTE, and 
not routers/backhauls/switches.


On Wed, Feb 15, 2017 at 7:15 PM Adam Moffett 
<[email protected]<mailto:[email protected]>> wrote:
Weird.  Maybe overflow from the dedicated bearer falls into the default bearer? 
 I also have to wonder if it's a bug in the UE.  It seems like it must fall on 
the UE to ultimately enforce the rate limit.

In our uplink throughput issue, I might have tripped over something of 
interest.  I originally reported to Telrad that I was getting about half of 
what I expect for UL throughput.  Now I think we actually do get the expected 
throughput, but only for a moment.  Five seconds later there's next to nothing, 
then 5 seconds later back to full speed, and so on.  I see it when looking at 
the realtime traffic display on our switch port, but on your typical chart with 
a 5 minute average it just looks like you're getting half speed.

Weird thing is that it's not happening all the time.  I started iPerf on 6 UE 
at one site at 4am the other day and when looking at traffic at the switch port 
I saw a perfect sine wave with 10 seconds peak to peak.  Later that day I 
repeated the test to show one of my co-workers and the damn thing wouldn't do 
it.

I don't know what to make of it yet.


------ Original Message ------
From: "Nathan Anderson" <[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>; "'Adam Moffett'" 
<[email protected]<mailto:[email protected]>>
Sent: 2/10/2017 3:59:40 PM
Subject: RE: [Telrad] Uplink throughput again

So last night, I re-ran this test again, and captured the whole thing not just 
at the edge of the LTE network coming out of the EPC, but between the EPC and 
eNB, so that I could grab the user traffic together with the encapsulating GTP 
headers.

What I found was that when traffic comes from behind the UE with the proper 
DSCP value set, it DOES get transmitted by the UE on the dedicated bearer, but 
the MBR is still not being enforced.  I had a 10Mbit/s UL AMBR configured and a 
256Kbit/s UL MBR set on the dedicated bearer, and when I ran an upload test on 
the dedicated bearer, it hit 10 megs.  (Download test on the dedicated bearer 
was limited to the configured 256Kbit/s DL MBR.)

What makes this so bizarre is that even if there is a bug that causes the 
system (which part?) to not enforce the configured rate limit for the dedicated 
bearer on the uplink, the UE AMBR should not be taken into account for GBR 
bearers, as discussed before.  But it sure seems like what is happening is that 
whatever is supposed to be policing the uplink is mistakenly enforcing the UE 
UL AMBR on the dedicated bearer instead of the UL MBR.

Ticket opened with Telrad.

-- Nathan

From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]<mailto:[email protected]>] On Behalf Of 
Nathan Anderson
Sent: Monday, February 06, 2017 3:56 PM

To: 'Adam Moffett'; [email protected]<mailto:[email protected]>
Subject: Re: [Telrad] Uplink throughput again

Then maybe the problem is not that the properly-marked upload traffic isn't 
getting transmitted on the right bearer, but rather that the UL GBR/MBR are not 
being enforced?

Whose responsibility is enforcement of bitrates on uplink?  The UE's?  The eNB? 
 The EPC?  A little of columns A, B, and C?

-- Nathan

From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]] On Behalf Of Adam Moffett
Sent: Monday, February 06, 2017 2:50 PM

To: [email protected]<mailto:[email protected]>
Subject: Re: [Telrad] Uplink throughput again

Somewhere there must be traffic counters for each QCI, or for individual 
bearers, or something.  Without seeing them it's hard to say for sure.

On a busy eNB (50+ UE), I tried changing the mgmt DSCP value on an individual 
UE from 6 to 5 and testing before and after.

With the UE set to DSCP 5 for mgmt, I get 0.1 mbps upload and 7% packet loss 
(500 byte pings, 0.1 second interval)
On DSCP 6 I get 0.5mbps and 0% packet loss.

That's not scientific rigor, but it seems like it's working.

On a lighter loaded eNB I was actually getting slightly more UL throughput with 
the UE Mgmt DSCP set to 5.  I don't know why.

-Adam



------ Original Message ------
From: "Nathan Anderson" <[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>; "'Adam Moffett'" 
<[email protected]<mailto:[email protected]>>
Sent: 2/6/2017 5:11:49 PM
Subject: RE: [Telrad] Uplink throughput again

...also, I still remain unconvinced that the UEs are transmitting any upload 
traffic -- even when properly marked with the right DSCP -- on the dedicated 
bearer.  Until it is proven beyond a doubt that this works, testing upload 
capacity using dedicated bearers is probably a waste of time because it isn't 
doing what you think it is doing.

I have tested both CPE7000 and CPE8000 at this point, and have the same issue 
on both, so I don't think it is a CPE firmware bug (that would be a freaky 
coincidence, given that both CPEs are contract-manufactured by different 
companies).  So I don't know if this is me being stupid and not configuring my 
EPCs correctly, or what.  But something is not working here.

-- Nathan

From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]<mailto:[email protected]>] On Behalf Of 
Nathan Anderson
Sent: Monday, February 06, 2017 2:06 PM
To: 'Adam Moffett'; [email protected]<mailto:[email protected]>
Subject: Re: [Telrad] Uplink throughput again

Something that I learned that I should point out:

A dedicated bearer with a higher priority should take precedence over default 
bearer traffic, yes.  But from what I can tell, LTE spec. does not have a way 
of putting a total speed cap on the entire UE across any and all bearers.  The 
UE AMBRs only restrict all non-GBR bearers (default or not, even across 
multiple APNs) but does NOT take into account GBR bearers, and QCI 1 is GBR.

What this means is that, for example, if you have a default bearer with QCI 6, 
and dedicated bearer with QCI 1, and the UE DL and UL AMBRs are set to 10 and 1 
Mbit/s respectively, and your dedicated bearer's MBRs are set to 5 and 0.5 
(half of the UE AMBRs, for the sake of this example), you haven't actually set 
up things such that up to half of the subscriber's AMBRs are given priority on 
the dedicated bearer, leaving that user half of his total bandwidth if you end 
up filling the dedicated bearer up to its MBR in both directions.  No, instead 
because the GBR QCIs are not accounted for within the AMBR, the user can move 
up to 5x0.5 on the dedicated bearer and *simultaneously* also move up to 10x1 
(assuming there is enough sector capacity at the time) on the default bearer.

Maybe in some cases, this is desireable.  If you use QCI 1 for VoIP, for 
example, then you are effectively providing the customer with a separate 
channel for their voice calls that does not dip into their configured speed 
package, but is instead additive.  But it is something to keep in mind as you 
are planning and building your network as well as running tests.

-- Nathan

From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]] On Behalf Of Adam Moffett
Sent: Monday, February 06, 2017 1:48 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: [Telrad] Uplink throughput again

The EPC and most of the eNB are running the latest general release available on 
Zendesk.
A couple of eNB are running some kind of maintenance release that support 
wanted us to try.

I'm making sure to run iPerf on the dedicated bearer to eliminate other user 
traffic from weaker UE as a factor.  At QCI 1 it should take precedence over 
the default bearer traffic.

I would definitely take the time to set one up, not necessarily for this 
purpose, but rather to ensure you always have access to your UE.  If the 
default bearer is hosed with a torrent and you don't have a dedicated bearer 
for management access then you can be completely locked out of the unit.  
Monitoring, management access, and firmware updates all work more reliably with 
the dedicated bearer and I'd strongly recommend it.  There's a knowledge base 
article in Zendesk about it.  Use DSCP 6 because that's tagged by default in 
the UE.



------ Original Message ------
From: "Jeremy Austin" <[email protected]<mailto:[email protected]>>
To: "Adam Moffett" <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]>
Sent: 2/6/2017 4:30:43 PM
Subject: Re: [Telrad] Uplink throughput again


On Mon, Feb 6, 2017 at 12:20 PM, Adam Moffett 
<[email protected]<mailto:[email protected]>> wrote:
Can somebody tell me if they're getting expected uplink throughput?


What ENB and EPC revisions are you at, Adam?

We're investigating this same issue ourselves, although we haven't tried a 
dedicated bearer.


--
Jeremy Austin

(907) 895-2311<tel:(907)%20895-2311>
(907) 803-5422<tel:(907)%20803-5422>
[email protected]<mailto:[email protected]>

Heritage NetWorks
Whitestone Power & Communications
Vertical Broadband, LLC

Schedule a meeting: http://doodle.com/jermudgeon
_______________________________________________
Telrad mailing list
[email protected]<mailto:[email protected]>
http://lists.wispa.org/mailman/listinfo/telrad

_______________________________________________
Telrad mailing list
[email protected]<mailto:[email protected]>
http://lists.wispa.org/mailman/listinfo/telrad



--
Jeremy Austin

(907) 895-2311
(907) 803-5422
[email protected]<mailto:[email protected]>

Heritage NetWorks
Whitestone Power & Communications
Vertical Broadband, LLC

Schedule a meeting: http://doodle.com/jermudgeon

_______________________________________________
Telrad mailing list
[email protected]
http://lists.wispa.org/mailman/listinfo/telrad

Re: [Telrad] Uplink throughput again

Reply via email to