Just an update to this: at the direction of Telrad support, I ran 2 
simultaneous packet captures during a download where corruption occurred: one 
right at the point of ingress at the EPC, and one right at the point of egress.

It turns out that I was WRONG about part of this.  The EPC is definitely 
corrupting traffic in the newer firmwares we have been given, as the captures 
demonstrate, but it is NOT also regenerating the TCP payload checksums on every 
packet that flows through it, thank goodness.  No, it turns out that the reason 
these payloads are making it all the way to the user is because the CPE7000's 
NAT engine is the one completely recomputing the checksums, instead of properly 
modifying them to only reflect the changes that it makes to the headers (see 
https://www.ietf.org/rfc/rfc1631.txt section 
3.3<https://www.ietf.org/rfc/rfc1631.txt%20section%203.3>).  So this is a 
two-parter: the EPC is corrupting bits, and the CPE7000 is responsible for 
covering up the corruption.

I tested with a CPE8000, and its NAT engine is doing the right thing.  Thus, 
the corrupt packets make it to the client, which sees the invalid checksum, and 
which tosses the packet, triggering retransmit.

The EPC firmware we have been using is a development build, and the corruption 
bug appears to be unique to that.  But the CPE7000 firmware we used for testing 
was the latest public release (116).

-- Nathan

From: telrad-boun...@wispa.org [mailto:telrad-boun...@wispa.org] On Behalf Of 
Nathan Anderson
Sent: Wednesday, March 15, 2017 1:47 AM
To: telrad@wispa.org
Subject: Re: [Telrad] Uplink throughput again

This is exactly it.  We didn't have the visibility into things to see what was 
causing the poor throughput at first (yet another one of our longstanding 
frustrations with the platform), but this is the problem that Jeremy and I were 
referring to.

I'm glad to say that we have not (knowingly) experienced the CPU usage 
fluctuations on our EPCs.

As far as the data corruption one, you likely will not have run up against it 
unless you are running a preproduction release of 6.7.  The symptoms are that 
we will see clusters of 4 consecutive bytes that have various bits flipped 
(usually what happens is that bytes 1 and 2 are zeroed out, and bytes 3 and 4 
are completely different than what they would normally be, but the pattern of 
what exactly is changed is not clear to us yet).  We see on average between 12 
and 60 bytes per 100MB transferred per user in this state.  The VERY BAD and 
VERY SCARY part is that if you do a packet capture, you will see that exactly 
zero TCP packets have a checksum that does not validate.  So it's not like data 
is getting corrupted, and a lot of packets are being thrown out because the 
checksum doesn't compute/match, but a small percentage or handful get through.  
No, every single packet has a valid checksum, even the ones with corrupt data 
in them.  What this means is that 1) HTTPS transfers just stop and die when the 
corruption occurs, and 2) HTTP/FTP/other unencrypted transfers introduce silent 
data corruption into the download that you won't discover until it is too late.

That all packets have a checksum that validates would seem to suggest that the 
EPC is ingesting TCP packets from the PDN interface, throwing out the original 
TCP checksum (as a shortcut, or...? what valid reasons would you possibly have 
for doing this?), doing something internally that causes random corruption, and 
then recomputing a new checksum from scratch before sending it onto the target 
user over S1-U.  That a bug like this is even *possible* BLOWS MY MIND.  If 
you're going to ignore the original checksum that the packet arrives with, 
what's the point of the checksum in the first place?  How can I ever trust the 
data flowing through this device again knowing that it is working around and 
subverting a key component that helps to ensure and preserve data integrity?

-- Nathan

From: telrad-boun...@wispa.org<mailto:telrad-boun...@wispa.org> 
[mailto:telrad-boun...@wispa.org] On Behalf Of Adam Moffett
Sent: Tuesday, March 14, 2017 8:34 PM
To: telrad@wispa.org<mailto:telrad@wispa.org>; 
telrad@wispa.org<mailto:telrad@wispa.org>
Subject: Re: [Telrad] Uplink throughput again

* UE getting stuck at MCS4....apparently until an S1 reset.  This may or may 
not be the same throughput issue that you guys were talking about earlier in 
the thread.
_______________________________________________
Telrad mailing list
Telrad@wispa.org
http://lists.wispa.org/mailman/listinfo/telrad

Reply via email to