Nathan, thanks for the clarification.

On Thu, Mar 16, 2017 at 8:54 AM Nathan Anderson <[email protected]> wrote:

> Just an update to this: at the direction of Telrad support, I ran 2
> simultaneous packet captures during a download where corruption occurred:
> one right at the point of ingress at the EPC, and one right at the point of
> egress.
>
>
>
> It turns out that I was WRONG about part of this.  The EPC is definitely
> corrupting traffic in the newer firmwares we have been given, as the
> captures demonstrate, but it is NOT also regenerating the TCP payload
> checksums on every packet that flows through it, thank goodness.  No, it
> turns out that the reason these payloads are making it all the way to the
> user is because the CPE7000's NAT engine is the one completely recomputing
> the checksums, instead of properly modifying them to only reflect the
> changes that it makes to the headers (see https://www.ietf.org/rfc/rfc1631.txt
> section 3.3).  So this is a two-parter: the EPC is corrupting bits, and
> the CPE7000 is responsible for covering up the corruption.
>
>
>
> I tested with a CPE8000, and its NAT engine is doing the right thing.
> Thus, the corrupt packets make it to the client, which sees the invalid
> checksum, and which tosses the packet, triggering retransmit.
>
>
>
> The EPC firmware we have been using is a development build, and the
> corruption bug appears to be unique to that.  But the CPE7000 firmware we
> used for testing was the latest public release (116).
>
>
>
> -- Nathan
>
>
>
> *From:* [email protected] [mailto:[email protected]] *On
> Behalf Of *Nathan Anderson
> *Sent:* Wednesday, March 15, 2017 1:47 AM
> *To:* [email protected]
>
>
> *Subject:* Re: [Telrad] Uplink throughput again
>
>
>
> This is exactly it.  We didn't have the visibility into things to see what
> was causing the poor throughput at first (yet another one of our
> longstanding frustrations with the platform), but this is the problem that
> Jeremy and I were referring to.
>
>
>
> I'm glad to say that we have not (knowingly) experienced the CPU usage
> fluctuations on our EPCs.
>
>
>
> As far as the data corruption one, you likely will not have run up against
> it unless you are running a preproduction release of 6.7.  The symptoms are
> that we will see clusters of 4 consecutive bytes that have various bits
> flipped (usually what happens is that bytes 1 and 2 are zeroed out, and
> bytes 3 and 4 are completely different than what they would normally be,
> but the pattern of what exactly is changed is not clear to us yet).  We see
> on average between 12 and 60 bytes per 100MB transferred per user in this
> state.  The VERY BAD and VERY SCARY part is that if you do a packet
> capture, you will see that exactly zero TCP packets have a checksum that
> does not validate.  So it's not like data is getting corrupted, and a lot
> of packets are being thrown out because the checksum doesn't compute/match,
> but a small percentage or handful get through.  No, every single packet has
> a valid checksum, even the ones with corrupt data in them.  What this means
> is that 1) HTTPS transfers just stop and die when the corruption occurs,
> and 2) HTTP/FTP/other unencrypted transfers introduce silent data
> corruption into the download that you won't discover until it is too late.
>
>
>
> That all packets have a checksum that validates would seem to suggest that
> the EPC is ingesting TCP packets from the PDN interface, throwing out the
> original TCP checksum (as a shortcut, or...? what valid reasons would you
> possibly have for doing this?), doing something internally that causes
> random corruption, and then recomputing a new checksum from scratch before
> sending it onto the target user over S1-U.  That a bug like this is even
> *possible* BLOWS MY MIND.  If you're going to ignore the original checksum
> that the packet arrives with, what's the point of the checksum in the first
> place?  How can I ever trust the data flowing through this device again
> knowing that it is working around and subverting a key component that helps
> to ensure and preserve data integrity?
>
>
>
> -- Nathan
>
>
>
> *From:* [email protected] [mailto:[email protected]
> <[email protected]>] *On Behalf Of *Adam Moffett
> *Sent:* Tuesday, March 14, 2017 8:34 PM
> *To:* [email protected]; [email protected]
> *Subject:* Re: [Telrad] Uplink throughput again
>
>
>
> * UE getting stuck at MCS4....apparently until an S1 reset.  This may or
> may not be the same throughput issue that you guys were talking about
> earlier in the thread.
> _______________________________________________
> Telrad mailing list
> [email protected]
> http://lists.wispa.org/mailman/listinfo/telrad
>
_______________________________________________
Telrad mailing list
[email protected]
http://lists.wispa.org/mailman/listinfo/telrad

Reply via email to