Nathan, thanks for the clarification. On Thu, Mar 16, 2017 at 8:54 AM Nathan Anderson <[email protected]> wrote:
> Just an update to this: at the direction of Telrad support, I ran 2 > simultaneous packet captures during a download where corruption occurred: > one right at the point of ingress at the EPC, and one right at the point of > egress. > > > > It turns out that I was WRONG about part of this. The EPC is definitely > corrupting traffic in the newer firmwares we have been given, as the > captures demonstrate, but it is NOT also regenerating the TCP payload > checksums on every packet that flows through it, thank goodness. No, it > turns out that the reason these payloads are making it all the way to the > user is because the CPE7000's NAT engine is the one completely recomputing > the checksums, instead of properly modifying them to only reflect the > changes that it makes to the headers (see https://www.ietf.org/rfc/rfc1631.txt > section 3.3). So this is a two-parter: the EPC is corrupting bits, and > the CPE7000 is responsible for covering up the corruption. > > > > I tested with a CPE8000, and its NAT engine is doing the right thing. > Thus, the corrupt packets make it to the client, which sees the invalid > checksum, and which tosses the packet, triggering retransmit. > > > > The EPC firmware we have been using is a development build, and the > corruption bug appears to be unique to that. But the CPE7000 firmware we > used for testing was the latest public release (116). > > > > -- Nathan > > > > *From:* [email protected] [mailto:[email protected]] *On > Behalf Of *Nathan Anderson > *Sent:* Wednesday, March 15, 2017 1:47 AM > *To:* [email protected] > > > *Subject:* Re: [Telrad] Uplink throughput again > > > > This is exactly it. We didn't have the visibility into things to see what > was causing the poor throughput at first (yet another one of our > longstanding frustrations with the platform), but this is the problem that > Jeremy and I were referring to. > > > > I'm glad to say that we have not (knowingly) experienced the CPU usage > fluctuations on our EPCs. > > > > As far as the data corruption one, you likely will not have run up against > it unless you are running a preproduction release of 6.7. The symptoms are > that we will see clusters of 4 consecutive bytes that have various bits > flipped (usually what happens is that bytes 1 and 2 are zeroed out, and > bytes 3 and 4 are completely different than what they would normally be, > but the pattern of what exactly is changed is not clear to us yet). We see > on average between 12 and 60 bytes per 100MB transferred per user in this > state. The VERY BAD and VERY SCARY part is that if you do a packet > capture, you will see that exactly zero TCP packets have a checksum that > does not validate. So it's not like data is getting corrupted, and a lot > of packets are being thrown out because the checksum doesn't compute/match, > but a small percentage or handful get through. No, every single packet has > a valid checksum, even the ones with corrupt data in them. What this means > is that 1) HTTPS transfers just stop and die when the corruption occurs, > and 2) HTTP/FTP/other unencrypted transfers introduce silent data > corruption into the download that you won't discover until it is too late. > > > > That all packets have a checksum that validates would seem to suggest that > the EPC is ingesting TCP packets from the PDN interface, throwing out the > original TCP checksum (as a shortcut, or...? what valid reasons would you > possibly have for doing this?), doing something internally that causes > random corruption, and then recomputing a new checksum from scratch before > sending it onto the target user over S1-U. That a bug like this is even > *possible* BLOWS MY MIND. If you're going to ignore the original checksum > that the packet arrives with, what's the point of the checksum in the first > place? How can I ever trust the data flowing through this device again > knowing that it is working around and subverting a key component that helps > to ensure and preserve data integrity? > > > > -- Nathan > > > > *From:* [email protected] [mailto:[email protected] > <[email protected]>] *On Behalf Of *Adam Moffett > *Sent:* Tuesday, March 14, 2017 8:34 PM > *To:* [email protected]; [email protected] > *Subject:* Re: [Telrad] Uplink throughput again > > > > * UE getting stuck at MCS4....apparently until an S1 reset. This may or > may not be the same throughput issue that you guys were talking about > earlier in the thread. > _______________________________________________ > Telrad mailing list > [email protected] > http://lists.wispa.org/mailman/listinfo/telrad >
_______________________________________________ Telrad mailing list [email protected] http://lists.wispa.org/mailman/listinfo/telrad
