Hey, This might be of interest...
Before, every time I got a GSO superpacket from the kernel, I'd split it into little packets, and then queue each little packet as a different parallel job. Now, every time I get a GSO super packet from the kernel, I split it into little packets, and queue up that whole bundle of packets as a single parallel job. This means that each GSO superpacket expansion gets processed on a single CPU. This greatly simplifies the algorithm, and delivers mega impressive performance throughput gains. In practice, what this means is that if you call send(tcp_socket_fd, buffer, biglength), then each 65k contiguous chunk of buffer will be encrypted on the same CPU. Before, each 1.5k contiguous chunk would be encrypted on the same CPU. I had thought about doing this a long time ago, but didn't, due to reasons that are now fuzzy to me. I believe it had something to do with latency. But at the moment, I think this solution will actually reduce latency on systems with lots of cores, since it means those cores don't all have to be synchronized before a bundle can be sent out. I haven't measured this yet, and I welcome any such tests. The magic commit for this is [1], if you'd like to compare before and after. Are there any obvious objections I've overlooked with this simplified approach? Thanks, Jason [1] https://git.zx2c4.com/WireGuard/commit/?id=7901251422e55bcd55ab04afb7fb390983593e39 _______________________________________________ WireGuard mailing list [email protected] http://lists.zx2c4.com/mailman/listinfo/wireguard
