Your post comes at an interesting time.
For the last few weeks, we have been fighting with Telrad engineering over
multiple issues. The vast majority of them have been on the EPC side (many of
the bugs have been crashers or otherwise extremely service-affecting!), some of
which I have detailed in past posts here. Fortunately, within the last couple
of weeks we have actually made significant progress with these and the latest
EPC code is the most stable we have seen in some time (though there are still
one or two outstanding issues; fortunately they do not seem to rear their heads
very often so by-and-large things are running smoothly).
I am still unconvinced that the dips I screenshotted a while back as recorded
by my realtime SNMP grapher are "real" and not just a function of the utility I
am using. As such I haven't spent much time chasing that particular issue
down. However, on the eNB side we have definitely run into the exact same
issue as your #2 in your list. So far, we have only noticed it on the eNBs
that are running the pre-release code, and we have only upgraded our 3 most
heavily-loaded eNBs to this release while the rest remain on 6.6 GA (4013).
(If the other eNBs were more loaded, we might feel pressure to upgrade them
past GA in order to resolve the upload performance issues that exist with the
GA release but which seems to largely be dealt with in the pre-release code.
As things stand, though, those sectors seem to be performing fine, so until
they either start seeing more traffic/UEs or I know that this particular
problem has been licked in an as-yet-unreleased eNB firmware, I will hold off
on further upgrades.)
We also seem to be fortunate in that it sounds like it is happening much less
often to us than it is to others (you). We ran pre-release eNB code for a good
2 weeks before we first encountered the issue (or, at least before we actually
noticed it). We discovered resetting RF interface cured it (S1 reset, quickly
toggle spectrum analyzer on/off, etc.; reboot takes way too long). It didn't
happen again for a couple of days but then happened 2 days in a row on the same
eNB. During one of those times, we managed to get one of the Israeli support
guys to remote in and investigate, collect logs, etc., but then after we
thought he was done and we reset the S1 to that eNB, he said there were a few
more things he forgot to collect and to let them know when it happens again.
And of course, it hasn't! (Actually not true...it happened on a second sector,
but I knew he wasn't going to be around at that very moment and I wanted to
test and see if kicking all UEs off worked just as well as a full S1 reset, and
sure enough it did. It simply hasn't happened again since then, so there
hasn't been another opportunity for them to remote in and collect more data.)
So in our case, it can sometimes be several days in between incidences. Also,
sometimes we will notice (by reviewing BreezeView KPIs) that it has actually
occurred but then self-correct with no action on our part, sometimes in as
little as 5-10 minutes (1-2 samples).
The guy I was interacting with wouldn't commit to a yes or no when I
point-blank asked him if this was "yet another bug" we were chasing (I wanted
to say "yet another @#$@#$%@ bug" but I bit my tongue ;-)). But given that it
is now clear (thanks to your post) that others are seeing the exact same thing,
I'm sure they know that they have another problem on their hands (as they CAN'T
just have heard about this from the two of us), as elusive as it might be
(impossible to reproduce at-will, etc.).
I, too, share your view of things: frustrated with the state of the product and
support infrastructure as a whole, but unwilling to pin blame on any individual
I have been in contact with. At some level, I empathize with them because I
have been in similar positions, chasing problems that are elusive and hard to
reproduce while at the same time having (legitimately) upset customers beating
you up over them. It *really* sucks, and I am *sure* that they acutely feel
the pressure to get these things fixed. At the same time, the sheer quantity
of issues we have experienced over the time we have owned this gear is somewhat
staggering, and often it seems like we trade one issue in for another, which
makes applying upgrades a scary prospect ("what new regression will we end up
fighting with this version?"). Over the last few months, we have scheduled
maintenance windows more times than I can count at the drop of a hat, often
several times in a given week, and have allowed bleeding-edge/hot off the
presses code to touch our *production* infrastructure, in essence allowing
Telrad to use our network as a guinea pig so that we can aid their engineers as
they work to reproduce these issues (since many of them have not been
reproducible in a lab.) I personally have lost countless hours of sleep and
built up a tremendous sleep debt over maintaining this system, and have fallen
behind on other duties (as well as life in general) as a result. I am trying
not to sound snippy here, but it's getting to the point where I'm seriously
considering asking the question of what sort of compensation we should expect
to get in return for all of this.
At the same time, perhaps partly because I see real progress, and partly
because I think I'm largely an optimist by nature, I still hold out hope that
things are eventually going to work how they ought. I can't remember where I
heard or read this, but my impression is that the Compact's WiMAX firmware was
in a similar state for a good while when it was first introduced, but is now
regarded as rock-solid. So they are probably just in a similar stage of the
development and evolution of the LTE product. That doesn't make it any less
frustrating, though, that we seem to have been caught in this particular stage
for as long as we have.
Sadly, I won't be at WISPAmerica. If you manage to get some productive face
time with Telrad there, I'd love to hear about it afterward.
-- Nathan
From: [email protected] [mailto:[email protected]] On Behalf Of
Jeremy Austin
Sent: Thursday, March 09, 2017 1:53 PM
To: [email protected]
Subject: Re: [Telrad] Uplink throughput again
On the other hand, the dips are back for us again. This is getting to be very
wearing.
To recap:
1) We are running the prerelease code
2) We have been having to reset S1/reboot ENBs periodically (multiple times a
day on one particular sector) due to a state of stuck high RF usage
3) The dips are back (60 second cycle, significant drop in throughput for about
7 seconds)
4) Otherwise, throughput is performing *better than ever*
We are now going on 8 full months with failure to resolve these. (To be fair,
other manufacturers can take a while to fix things as well.)
Customers are complaining.
Telrad has confirmed (multiple times) that there is nothing wrong with our
network/setup/UEs. They have confirmed that we have done every single thing we
can do to verify that performance issues are *not* our problem, but Telrad's.
However, replacing Telrad is not an option at present.
I doubt this falls under any lemon laws, but I can only describe our experience
of failures as systematic of core issues with the Telrad business.
Individually, I take no issue with either the American or Israeli support team.
Collectively, however, we have a significant problem. There is no 24/7 NOC/TAC…
and even if there were, the fact that we probably couldn't rouse an engineer
to observe/collect data when the trouble is occurring is a serious defect.
I'm looking forward to seeing the Telrad team at Wispamerica — but with
extremely mixed feelings about the support experience. I have attempted to burn
no bridges, but at the same time be very clear about what I perceive as a
systematic failure to deliver as promised.
I have been fairly quiet on list about our outstanding issues, thinking that
they would be better solved by superior troubleshooting and Telrad engineering
than by social engineering.
Perhaps it is time for that to change. Perhaps I am doing a disservice to other
Telrad customers by keeping quiet.
Thoughts?
On Thu, Feb 16, 2017 at 2:40 AM, Nathan Anderson
<[email protected]<mailto:[email protected]>> wrote:
Ugh, this is what I get for jumping to conclusions and running my mouth off
before doing just the slightest bit of investigation.
I think it might somehow just be the tool I'm using to do the graphing. If I
watch one of the active bandwidth tests closely while also watching the graph
of the eNB that UE is attached to, I don't (always) see the same dips.
Sooo, false alarm. Possibly. I'll keep watching things and report back.
If it's just a graphing error/anomaly, not sure what the problem would be here.
Both the tool and the switch that the eNBs are plugged into supposedly support
SNMP v2c, so we shouldn't be overrunning a 32-bit integer.
-- Nathan
From: [email protected]<mailto:[email protected]>
[mailto:[email protected]<mailto:[email protected]>] On Behalf Of
Adam Moffett
Sent: Thursday, February 16, 2017 2:18 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: [Telrad] Uplink throughput again
Interesting.
------ Original Message ------
From: "Nathan Anderson" <[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Sent: 2/16/2017 4:24:00 AM
Subject: Re: [Telrad] Uplink throughput again
Jeremy mentioned his periodic traffic dips to me recently off-list. I haven't
seen anything exactly like what either of you two are talking about,
but...attached is an interesting screenshot I just took of downlink usage on 3
separate eNBs on our network, each of which I am currently saturating
(off-hours) with MT download bandwidth test (occurring behind 1 UE on each
sector, and each UE has been temporarily granted 100Mbit downlink AMBR).
Notice the little icicle-like formations? Also notice how they seem to be
fairly regular, and also seem to occur at the exact same interval on every
sector, but don't perfectly line up with each other?
WTF is *that* about?
-- Nathan
From: [email protected]<mailto:[email protected]>
[mailto:[email protected]<mailto:[email protected]>] On Behalf Of
Jeremy Austin
Sent: Wednesday, February 15, 2017 8:44 PM
To: Adam Moffett; [email protected]<mailto:[email protected]>
Subject: Re: [Telrad] Uplink throughput again
Adam, I'm going to assume that no other traffic on the same equipment (sans EPC
and ENB) show this periodicity?
I have seen something in the same ballpark, but not identical, since August. I
have been planning to post it to the list to get more eyes on it (after letting
Telrad have some time to look at it first).
Just wanted to check that you had isolated the behavior entirely to LTE, and
not routers/backhauls/switches.
On Wed, Feb 15, 2017 at 7:15 PM Adam Moffett
<[email protected]<mailto:[email protected]>> wrote:
Weird. Maybe overflow from the dedicated bearer falls into the default bearer?
I also have to wonder if it's a bug in the UE. It seems like it must fall on
the UE to ultimately enforce the rate limit.
In our uplink throughput issue, I might have tripped over something of
interest. I originally reported to Telrad that I was getting about half of
what I expect for UL throughput. Now I think we actually do get the expected
throughput, but only for a moment. Five seconds later there's next to nothing,
then 5 seconds later back to full speed, and so on. I see it when looking at
the realtime traffic display on our switch port, but on your typical chart with
a 5 minute average it just looks like you're getting half speed.
Weird thing is that it's not happening all the time. I started iPerf on 6 UE
at one site at 4am the other day and when looking at traffic at the switch port
I saw a perfect sine wave with 10 seconds peak to peak. Later that day I
repeated the test to show one of my co-workers and the damn thing wouldn't do
it.
I don't know what to make of it yet.
------ Original Message ------
From: "Nathan Anderson" <[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>; "'Adam Moffett'"
<[email protected]<mailto:[email protected]>>
Sent: 2/10/2017 3:59:40 PM
Subject: RE: [Telrad] Uplink throughput again
So last night, I re-ran this test again, and captured the whole thing not just
at the edge of the LTE network coming out of the EPC, but between the EPC and
eNB, so that I could grab the user traffic together with the encapsulating GTP
headers.
What I found was that when traffic comes from behind the UE with the proper
DSCP value set, it DOES get transmitted by the UE on the dedicated bearer, but
the MBR is still not being enforced. I had a 10Mbit/s UL AMBR configured and a
256Kbit/s UL MBR set on the dedicated bearer, and when I ran an upload test on
the dedicated bearer, it hit 10 megs. (Download test on the dedicated bearer
was limited to the configured 256Kbit/s DL MBR.)
What makes this so bizarre is that even if there is a bug that causes the
system (which part?) to not enforce the configured rate limit for the dedicated
bearer on the uplink, the UE AMBR should not be taken into account for GBR
bearers, as discussed before. But it sure seems like what is happening is that
whatever is supposed to be policing the uplink is mistakenly enforcing the UE
UL AMBR on the dedicated bearer instead of the UL MBR.
Ticket opened with Telrad.
-- Nathan
From: [email protected]<mailto:[email protected]>
[mailto:[email protected]<mailto:[email protected]>] On Behalf Of
Nathan Anderson
Sent: Monday, February 06, 2017 3:56 PM
To: 'Adam Moffett'; [email protected]<mailto:[email protected]>
Subject: Re: [Telrad] Uplink throughput again
Then maybe the problem is not that the properly-marked upload traffic isn't
getting transmitted on the right bearer, but rather that the UL GBR/MBR are not
being enforced?
Whose responsibility is enforcement of bitrates on uplink? The UE's? The eNB?
The EPC? A little of columns A, B, and C?
-- Nathan
From: [email protected]<mailto:[email protected]>
[mailto:[email protected]] On Behalf Of Adam Moffett
Sent: Monday, February 06, 2017 2:50 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: [Telrad] Uplink throughput again
Somewhere there must be traffic counters for each QCI, or for individual
bearers, or something. Without seeing them it's hard to say for sure.
On a busy eNB (50+ UE), I tried changing the mgmt DSCP value on an individual
UE from 6 to 5 and testing before and after.
With the UE set to DSCP 5 for mgmt, I get 0.1 mbps upload and 7% packet loss
(500 byte pings, 0.1 second interval)
On DSCP 6 I get 0.5mbps and 0% packet loss.
That's not scientific rigor, but it seems like it's working.
On a lighter loaded eNB I was actually getting slightly more UL throughput with
the UE Mgmt DSCP set to 5. I don't know why.
-Adam
------ Original Message ------
From: "Nathan Anderson" <[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>; "'Adam Moffett'"
<[email protected]<mailto:[email protected]>>
Sent: 2/6/2017 5:11:49 PM
Subject: RE: [Telrad] Uplink throughput again
...also, I still remain unconvinced that the UEs are transmitting any upload
traffic -- even when properly marked with the right DSCP -- on the dedicated
bearer. Until it is proven beyond a doubt that this works, testing upload
capacity using dedicated bearers is probably a waste of time because it isn't
doing what you think it is doing.
I have tested both CPE7000 and CPE8000 at this point, and have the same issue
on both, so I don't think it is a CPE firmware bug (that would be a freaky
coincidence, given that both CPEs are contract-manufactured by different
companies). So I don't know if this is me being stupid and not configuring my
EPCs correctly, or what. But something is not working here.
-- Nathan
From: [email protected]<mailto:[email protected]>
[mailto:[email protected]<mailto:[email protected]>] On Behalf Of
Nathan Anderson
Sent: Monday, February 06, 2017 2:06 PM
To: 'Adam Moffett'; [email protected]<mailto:[email protected]>
Subject: Re: [Telrad] Uplink throughput again
Something that I learned that I should point out:
A dedicated bearer with a higher priority should take precedence over default
bearer traffic, yes. But from what I can tell, LTE spec. does not have a way
of putting a total speed cap on the entire UE across any and all bearers. The
UE AMBRs only restrict all non-GBR bearers (default or not, even across
multiple APNs) but does NOT take into account GBR bearers, and QCI 1 is GBR.
What this means is that, for example, if you have a default bearer with QCI 6,
and dedicated bearer with QCI 1, and the UE DL and UL AMBRs are set to 10 and 1
Mbit/s respectively, and your dedicated bearer's MBRs are set to 5 and 0.5
(half of the UE AMBRs, for the sake of this example), you haven't actually set
up things such that up to half of the subscriber's AMBRs are given priority on
the dedicated bearer, leaving that user half of his total bandwidth if you end
up filling the dedicated bearer up to its MBR in both directions. No, instead
because the GBR QCIs are not accounted for within the AMBR, the user can move
up to 5x0.5 on the dedicated bearer and *simultaneously* also move up to 10x1
(assuming there is enough sector capacity at the time) on the default bearer.
Maybe in some cases, this is desireable. If you use QCI 1 for VoIP, for
example, then you are effectively providing the customer with a separate
channel for their voice calls that does not dip into their configured speed
package, but is instead additive. But it is something to keep in mind as you
are planning and building your network as well as running tests.
-- Nathan
From: [email protected]<mailto:[email protected]>
[mailto:[email protected]] On Behalf Of Adam Moffett
Sent: Monday, February 06, 2017 1:48 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: [Telrad] Uplink throughput again
The EPC and most of the eNB are running the latest general release available on
Zendesk.
A couple of eNB are running some kind of maintenance release that support
wanted us to try.
I'm making sure to run iPerf on the dedicated bearer to eliminate other user
traffic from weaker UE as a factor. At QCI 1 it should take precedence over
the default bearer traffic.
I would definitely take the time to set one up, not necessarily for this
purpose, but rather to ensure you always have access to your UE. If the
default bearer is hosed with a torrent and you don't have a dedicated bearer
for management access then you can be completely locked out of the unit.
Monitoring, management access, and firmware updates all work more reliably with
the dedicated bearer and I'd strongly recommend it. There's a knowledge base
article in Zendesk about it. Use DSCP 6 because that's tagged by default in
the UE.
------ Original Message ------
From: "Jeremy Austin" <[email protected]<mailto:[email protected]>>
To: "Adam Moffett" <[email protected]<mailto:[email protected]>>;
[email protected]<mailto:[email protected]>
Sent: 2/6/2017 4:30:43 PM
Subject: Re: [Telrad] Uplink throughput again
On Mon, Feb 6, 2017 at 12:20 PM, Adam Moffett
<[email protected]<mailto:[email protected]>> wrote:
Can somebody tell me if they're getting expected uplink throughput?
What ENB and EPC revisions are you at, Adam?
We're investigating this same issue ourselves, although we haven't tried a
dedicated bearer.
--
Jeremy Austin
(907) 895-2311<tel:(907)%20895-2311>
(907) 803-5422<tel:(907)%20803-5422>
[email protected]<mailto:[email protected]>
Heritage NetWorks
Whitestone Power & Communications
Vertical Broadband, LLC
Schedule a meeting: http://doodle.com/jermudgeon
_______________________________________________
Telrad mailing list
[email protected]<mailto:[email protected]>
http://lists.wispa.org/mailman/listinfo/telrad
_______________________________________________
Telrad mailing list
[email protected]<mailto:[email protected]>
http://lists.wispa.org/mailman/listinfo/telrad
--
Jeremy Austin
(907) 895-2311
(907) 803-5422
[email protected]<mailto:[email protected]>
Heritage NetWorks
Whitestone Power & Communications
Vertical Broadband, LLC
Schedule a meeting: http://doodle.com/jermudgeon
_______________________________________________
Telrad mailing list
[email protected]
http://lists.wispa.org/mailman/listinfo/telrad