Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
On Thu, Jun 29, 2017 at 04:43:28PM -0700, Tom Herbert wrote: > On Thu, Jun 29, 2017 at 1:58 PM, Willy Tarreauwrote: > > On Thu, Jun 29, 2017 at 01:40:26PM -0700, Tom Herbert wrote: > >> > In fact that's not much what I observe in field. In practice, large > >> > data streams are cheaply relayed using splice(), I could achieve > >> > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago. > >> > And when you use SSL, the cost of the copy to/from kernel is small > >> > compared to all the crypto operations surrounding this. > >> > > >> Right, getting rid of the extra crypto operations and so called "SSL > >> inspection" is the ultimate goal this is going towards. > > > > Yep but in order to take decisions at L7 you need to decapsulate SSL. > > > Decapsulate or decrypt? There's a big difference... :-) I'm am aiming > to just have to decapsulate. Sorry, but what difference do you make ? For me "decapsulate" means "extract the next level layer", and for SSL it means you need to decrypt. > > > >> Performance is relevant because we > >> potentially want security applied to every message in every > >> communication in a containerized data center. Putting the userspace > >> hop in the datapath of every packet is know to be problematic, not > >> just for the performance hit also because it increases the attack > >> surface on users' privacy. > > > > While I totally agree on the performance hit when inspecting each packet, > > I fail to see the relation with users' privacy. In fact under some > > circumstances it can even be the opposite. For example, using something > > like kTLS for a TCP/HTTP proxy can result in cleartext being observable > > in strace while it's not visible when TLS is terminated in userland because > > all you see are openssl's read()/write() operations. Maybe you have specific > > attacks in mind ? > > > No, just the normal problem of making yet one more tool systematically > have access to user data. OK. > >> > Regarding kernel-side protocol parsing, there's an unfortunate trend > >> > at moving more and more protocols to userland due to these protocols > >> > evolving very quickly. At least you'll want to find a way to provide > >> > these parsers from userspace, which will inevitably come with its set > >> > of problems or limitations :-/ > >> > > >> That's why everything is going BPF now ;-) > > > > Yes, I knew you were going to suggest this :-) I'm still prudent on it > > to be honnest. I don't think it would be that easy to implement an HPACK > > encoder/decoder using BPF. And even regarding just plain HTTP parsing, > > certain very small operations in haproxy's parser can quickly result in > > a 10% performance degradation when improperly optimized (ie: changing a > > "likely", altering branch prediction, or cache walk patterns when using > > arrays to evaluate character classes faster). But for general usage I > > indeed think it should be OK. > > > HTTP might qualify as a special case, and I believe there's already > been some work to put http in kernel by Alexander Krizhanovsky and > others. In this case maybe http parse could be front end before BPF. It could indeed be an option. We've seen this with Tux in the past. > Although, pretty clear we'll need regex in BPF if we want use it with > http. I think so as well. And some loop-like operations (foreach or stuff like this so that they remain bounded). Willy
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
On 29 June 2017 at 16:21, Tom Herbertwrote: > I think the main part of that discussion was around stream parser > which is needed for message delineation. For a 1:1 proxy, KCM is > probably overkill (the whole KCM data path and lock becomes > superfluous). Also, there's no concept of creating a whole message > before routing it, in the 1:1 case we should let the message pass > through once it's cleared by the filter (this is the strparser change > I referred to). As I mentioned, for L7 load balancing we would want a > multiplexor probably also M:N, but the structure is different since > there's still no user facing sockets, they're all TCP for instance. > IMO, the 1:1 proxy case is compelling to solve in itself... I see. I was definitely thinking m:n. We should definitely evaluate whether it makes sense to have a specific 1:1 implementation if we need m:n anyway. For L7 LB, m:n seems obvious as a particular L4 connection may act as a transport for multiple requests bidirectional. KCM looks like a good starting point for that. When I talked about enqueueing entire messages, the main concern is to buffer up the payload after the TLS handshake to the point to where a forwarding decision can be made. I would definitely not advocate to buffer entire messages before starting to forward.
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
On Thu, Jun 29, 2017 at 1:58 PM, Willy Tarreauwrote: > On Thu, Jun 29, 2017 at 01:40:26PM -0700, Tom Herbert wrote: >> > In fact that's not much what I observe in field. In practice, large >> > data streams are cheaply relayed using splice(), I could achieve >> > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago. >> > And when you use SSL, the cost of the copy to/from kernel is small >> > compared to all the crypto operations surrounding this. >> > >> Right, getting rid of the extra crypto operations and so called "SSL >> inspection" is the ultimate goal this is going towards. > > Yep but in order to take decisions at L7 you need to decapsulate SSL. > Decapsulate or decrypt? There's a big difference... :-) I'm am aiming to just have to decapsulate. >> HTTP is only one use case. The are other interesting use cases such as >> those in container security where the application protocol might be >> something like simple RPC. > > OK that indeed makes sense in such environments. > >> Performance is relevant because we >> potentially want security applied to every message in every >> communication in a containerized data center. Putting the userspace >> hop in the datapath of every packet is know to be problematic, not >> just for the performance hit also because it increases the attack >> surface on users' privacy. > > While I totally agree on the performance hit when inspecting each packet, > I fail to see the relation with users' privacy. In fact under some > circumstances it can even be the opposite. For example, using something > like kTLS for a TCP/HTTP proxy can result in cleartext being observable > in strace while it's not visible when TLS is terminated in userland because > all you see are openssl's read()/write() operations. Maybe you have specific > attacks in mind ? > No, just the normal problem of making yet one more tool systematically have access to user data. >> > Regarding kernel-side protocol parsing, there's an unfortunate trend >> > at moving more and more protocols to userland due to these protocols >> > evolving very quickly. At least you'll want to find a way to provide >> > these parsers from userspace, which will inevitably come with its set >> > of problems or limitations :-/ >> > >> That's why everything is going BPF now ;-) > > Yes, I knew you were going to suggest this :-) I'm still prudent on it > to be honnest. I don't think it would be that easy to implement an HPACK > encoder/decoder using BPF. And even regarding just plain HTTP parsing, > certain very small operations in haproxy's parser can quickly result in > a 10% performance degradation when improperly optimized (ie: changing a > "likely", altering branch prediction, or cache walk patterns when using > arrays to evaluate character classes faster). But for general usage I > indeed think it should be OK. > HTTP might qualify as a special case, and I believe there's already been some work to put http in kernel by Alexander Krizhanovsky and others. In this case maybe http parse could be front end before BPF. Although, pretty clear we'll need regex in BPF if we want use it with http. Tom
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
On Thu, Jun 29, 2017 at 3:04 PM, Thomas Grafwrote: > Hi Tom > > On 29 June 2017 at 11:27, Tom Herbert wrote: >> This is raw, minimally tested, and error hanlding needs work. Posting >> as RFC to get feedback on the design... >> >> Sidecar proxies are becoming quite popular on server as a means to >> perform layer 7 processing on application data as it is sent. Such >> sidecars are used for SSL proxies, application firewalls, and L7 >> load balancers. While these proxies provide nice functionality, >> their performance is obviously terrible since all the data needs >> to take an extra hop though userspace. > Hi Thomas, > I really appreciate this work. It would have been nice to at least > attribute me in some way as this is exactly what I presented at > Netconf 2017 [0]. > Sure, will do that! > I'm also wondering why this is not built on top of KCM which you > suggested to use when we discussed this. > I think the main part of that discussion was around stream parser which is needed for message delineation. For a 1:1 proxy, KCM is probably overkill (the whole KCM data path and lock becomes superfluous). Also, there's no concept of creating a whole message before routing it, in the 1:1 case we should let the message pass through once it's cleared by the filter (this is the strparser change I referred to). As I mentioned, for L7 load balancing we would want a multiplexor probably also M:N, but the structure is different since there's still no user facing sockets, they're all TCP for instance. IMO, the 1:1 proxy case is compelling to solve in itself... Tom > [0] > https://docs.google.com/presentation/d/1dwSKSBGpUHD3WO5xxzZWj8awV_-xL-oYhvqQMOBhhtk/edit#slide=id.g203aae413f_0_0
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
Hi Tom On 29 June 2017 at 11:27, Tom Herbertwrote: > This is raw, minimally tested, and error hanlding needs work. Posting > as RFC to get feedback on the design... > > Sidecar proxies are becoming quite popular on server as a means to > perform layer 7 processing on application data as it is sent. Such > sidecars are used for SSL proxies, application firewalls, and L7 > load balancers. While these proxies provide nice functionality, > their performance is obviously terrible since all the data needs > to take an extra hop though userspace. I really appreciate this work. It would have been nice to at least attribute me in some way as this is exactly what I presented at Netconf 2017 [0]. I'm also wondering why this is not built on top of KCM which you suggested to use when we discussed this. [0] https://docs.google.com/presentation/d/1dwSKSBGpUHD3WO5xxzZWj8awV_-xL-oYhvqQMOBhhtk/edit#slide=id.g203aae413f_0_0
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
On Thu, Jun 29, 2017 at 01:40:26PM -0700, Tom Herbert wrote: > > In fact that's not much what I observe in field. In practice, large > > data streams are cheaply relayed using splice(), I could achieve > > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago. > > And when you use SSL, the cost of the copy to/from kernel is small > > compared to all the crypto operations surrounding this. > > > Right, getting rid of the extra crypto operations and so called "SSL > inspection" is the ultimate goal this is going towards. Yep but in order to take decisions at L7 you need to decapsulate SSL. > HTTP is only one use case. The are other interesting use cases such as > those in container security where the application protocol might be > something like simple RPC. OK that indeed makes sense in such environments. > Performance is relevant because we > potentially want security applied to every message in every > communication in a containerized data center. Putting the userspace > hop in the datapath of every packet is know to be problematic, not > just for the performance hit also because it increases the attack > surface on users' privacy. While I totally agree on the performance hit when inspecting each packet, I fail to see the relation with users' privacy. In fact under some circumstances it can even be the opposite. For example, using something like kTLS for a TCP/HTTP proxy can result in cleartext being observable in strace while it's not visible when TLS is terminated in userland because all you see are openssl's read()/write() operations. Maybe you have specific attacks in mind ? > > Regarding kernel-side protocol parsing, there's an unfortunate trend > > at moving more and more protocols to userland due to these protocols > > evolving very quickly. At least you'll want to find a way to provide > > these parsers from userspace, which will inevitably come with its set > > of problems or limitations :-/ > > > That's why everything is going BPF now ;-) Yes, I knew you were going to suggest this :-) I'm still prudent on it to be honnest. I don't think it would be that easy to implement an HPACK encoder/decoder using BPF. And even regarding just plain HTTP parsing, certain very small operations in haproxy's parser can quickly result in a 10% performance degradation when improperly optimized (ie: changing a "likely", altering branch prediction, or cache walk patterns when using arrays to evaluate character classes faster). But for general usage I indeed think it should be OK. > > All this to say that while I can definitely imagine the benefits of > > having in-kernel sockets for in-kernel L7 processing or filtering, > > I'm having strong doubts about the benefits that userland may receive > > by using this (or maybe you already have any performance numbers > > supporting this ?). > > > Nope, no numbers yet. OK, no worries. Thanks for your explanations! Willy
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
Hi Willy, Thanks for the comments! > In fact that's not much what I observe in field. In practice, large > data streams are cheaply relayed using splice(), I could achieve > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago. > And when you use SSL, the cost of the copy to/from kernel is small > compared to all the crypto operations surrounding this. > Right, getting rid of the extra crypto operations and so called "SSL inspection" is the ultimate goal this is going towards. > Another point is that most HTTP requests are quite small (typically ~80% > 20kB or less), and in this case the L7 processing and certain syscalls > significantly dominate the operations, data copies are comparatively > small. Simply parsing a HTTP header takes time (when you do it correctly). > You can hardly parse and index more than 800MB-1GB/s of HTTP headers > per core, which limits you to roughly 1-1.2 M req+resp per second for > a 400 byte request and a 400 byte response, and that's without any > processing at all. But when doing this, certain syscalls like connect(), > close() or epollctl() start to be quite expensive. Even splice() is > expensive to forward small data chunks because you need two calls, and > recv+send is faster. In fact our TCP stack has been so much optimized > for realistic workloads over the years that it becomes hard to gain > more by cheating on it :-) > > In the end in haproxy I'm seeing about 300k req+resp per second in > HTTP keep-alive and more like 100-130k with close, when disabling > TCP quick-ack during accept() and connect() to save one ACK on each > side (just doing this generally brings performance gains between 7 > and 10%). > HTTP is only one use case. The are other interesting use cases such as those in container security where the application protocol might be something like simple RPC. Performance is relevant because we potentially want security applied to every message in every communication in a containerized data center. Putting the userspace hop in the datapath of every packet is know to be problematic, not just for the performance hit also because it increases the attack surface on users' privacy. > Regarding kernel-side protocol parsing, there's an unfortunate trend > at moving more and more protocols to userland due to these protocols > evolving very quickly. At least you'll want to find a way to provide > these parsers from userspace, which will inevitably come with its set > of problems or limitations :-/ > That's why everything is going BPF now ;-) > All this to say that while I can definitely imagine the benefits of > having in-kernel sockets for in-kernel L7 processing or filtering, > I'm having strong doubts about the benefits that userland may receive > by using this (or maybe you already have any performance numbers > supporting this ?). > Nope, no numbers yet. > Just my two cents, > Willy
Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
Hi Tom, On Thu, Jun 29, 2017 at 11:27:03AM -0700, Tom Herbert wrote: > Sidecar proxies are becoming quite popular on server as a means to > perform layer 7 processing on application data as it is sent. Such > sidecars are used for SSL proxies, application firewalls, and L7 > load balancers. While these proxies provide nice functionality, > their performance is obviously terrible since all the data needs > to take an extra hop though userspace. > > Consider transmitting data on a TCP socket that goes through a > sidecar paroxy. The application does a sendmsg in userpsace, data > goes into kernel, back to userspace, and back to kernel. That is two > trips through TCP TX, one TCP RX, potentially three copies, three > sockets are touched, and three context switches. Using a proxy in the > receive path would have a similarly long path. > >+--+ +--+ >| Application | | Proxy| >| | | | >| sendmsg | | recvmsg sendmsg | >+--+ +--+ > || | >|^ | > ---V|---|-- > || | > +>->-+ V > TCP TX TCP RXTCP TX > > The "boomerang" model this employs is quite expensive. This is > even much worse in the case that the proxy is an SSL proxy (e.g. > performing SSL inspection to implement and application firewall). In fact that's not much what I observe in field. In practice, large data streams are cheaply relayed using splice(), I could achieve 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago. And when you use SSL, the cost of the copy to/from kernel is small compared to all the crypto operations surrounding this. Another point is that most HTTP requests are quite small (typically ~80% 20kB or less), and in this case the L7 processing and certain syscalls significantly dominate the operations, data copies are comparatively small. Simply parsing a HTTP header takes time (when you do it correctly). You can hardly parse and index more than 800MB-1GB/s of HTTP headers per core, which limits you to roughly 1-1.2 M req+resp per second for a 400 byte request and a 400 byte response, and that's without any processing at all. But when doing this, certain syscalls like connect(), close() or epollctl() start to be quite expensive. Even splice() is expensive to forward small data chunks because you need two calls, and recv+send is faster. In fact our TCP stack has been so much optimized for realistic workloads over the years that it becomes hard to gain more by cheating on it :-) In the end in haproxy I'm seeing about 300k req+resp per second in HTTP keep-alive and more like 100-130k with close, when disabling TCP quick-ack during accept() and connect() to save one ACK on each side (just doing this generally brings performance gains between 7 and 10%). Regarding kernel-side protocol parsing, there's an unfortunate trend at moving more and more protocols to userland due to these protocols evolving very quickly. At least you'll want to find a way to provide these parsers from userspace, which will inevitably come with its set of problems or limitations :-/ All this to say that while I can definitely imagine the benefits of having in-kernel sockets for in-kernel L7 processing or filtering, I'm having strong doubts about the benefits that userland may receive by using this (or maybe you already have any performance numbers supporting this ?). Just my two cents, Willy
[PATCH RFC 0/2] kproxy: Kernel Proxy
This is raw, minimally tested, and error hanlding needs work. Posting as RFC to get feedback on the design... Sidecar proxies are becoming quite popular on server as a means to perform layer 7 processing on application data as it is sent. Such sidecars are used for SSL proxies, application firewalls, and L7 load balancers. While these proxies provide nice functionality, their performance is obviously terrible since all the data needs to take an extra hop though userspace. Consider transmitting data on a TCP socket that goes through a sidecar paroxy. The application does a sendmsg in userpsace, data goes into kernel, back to userspace, and back to kernel. That is two trips through TCP TX, one TCP RX, potentially three copies, three sockets are touched, and three context switches. Using a proxy in the receive path would have a similarly long path. +--+ +--+ | Application | | Proxy| | | | | | sendmsg | | recvmsg sendmsg | +--+ +--+ || | |^ | ---V|---|-- || | +>->-+ V TCP TX TCP RXTCP TX The "boomerang" model this employs is quite expensive. This is even much worse in the case that the proxy is an SSL proxy (e.g. performing SSL inspection to implement and application firewall). In this case The application encrypts using TLS, the proxy immediately decrypts (it knows the key by virtue of have pretended to be a certificate authority). Subsequently, the proxy re-encrpyts it again to send. So each byte undergoes three crypto operations in this path! This patch set creates and in kernel proxy (kproxy). The concept is fairly straightforward, two sockets are joined in the kernel as a proxy. Proxy functionality will be done by BPF on the data stream, kTLS is needed to make an SSL proxy. The most prominent ULP for a proxy is http so we'll need a parser for http to make an in kernel http proxy. +--+ | Application | | | | sendmsg | +--+ | | ---V--- | |+---+ +--->| Proxy |---+ | strparser+BPF | | +---+ | TCP TX TCP RX | V TCP TX This patch set implements a very rudimentary kernel proxy, it just provides an interface to create a proxy between two sockets. Once the RX and TX paths are done for kTLS it should be straightforward to enable to make an in kernel SSL proxy. Proxy functionality (like application level filtering) will be implemented by BPF programs set on the kproxy. This will use strparser to provide message deliniation (we'll need a slight mofication to strparse to allow pass through mode). In kernel layer 7 load balancing is also feasible, in that case we may want to use a multiplexor structure like KCM (I had consider overloading KCM for kproxy but decided they are too different. kproxy eliminates the userspace boomerang, but you may notice that even with kproxy we still have same number of sockets and sill potentially perform three crypto ops on every byte. I have some ideas for how to create a "zero proxy" that eliminates these without loss of the proxy functionality. That would be the subject of a future path set. Tom Herbert (2): skbuff: Function to send an skbuf on a socket kproxy: Kernel proxy include/linux/skbuff.h | 2 + include/linux/socket.h | 4 +- include/net/kproxy.h| 80 + include/uapi/linux/kproxy.h | 30 ++ net/Kconfig | 1 + net/Makefile| 1 + net/core/skbuff.c | 66 net/kproxy/Kconfig | 10 + net/kproxy/Makefile | 3 + net/kproxy/kproxyproc.c | 246 +++ net/kproxy/kproxysock.c | 605 security/selinux/hooks.c| 4 +- security/selinux/include/classmap.h | 4 +- 13 files changed, 1053 insertions(+), 3 deletions(-) create mode 100644 include/net/kproxy.h create mode 100644 include/uapi/linux/kproxy.h create mode 100644 net/kproxy/Kconfig create mode 100644 net/kproxy/Makefile create mode 100644 net/kproxy/kproxyproc.c create mode 100644 net/kproxy/kproxysock.c -- 2.7.4