Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag

2016-06-21 Thread Or Gerlitz

On 6/21/2016 6:18 PM, Eric Dumazet wrote:

One solution would be to setup a special netdev used only for sniffers
(No IP address on it)

-> Only changes would happen in the driver, to set skb->dev to this
'debug' device.


Eric,

Yep, that was an option too, but when we realized that libpcap has the 
means to add plug-in for non-netdevices (e.g usb, can devices and now we 
are thinking to add one for uverbs),  it means we can avoid adding tons 
of pretty complex code into the kernel driver and happily have simpler 
code is user-space, so... why not? will try that 1st.


Or.



Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag

2016-06-21 Thread Eric Dumazet
On Tue, Jun 21, 2016 at 6:04 AM, Saeed Mahameed
 wrote:

>
> Alexei , I don't understand your concern.
> We already have a full/complete working dpdk bypass solution in
> userspace nothing extra is required from the kernel.
>
> We just want to see this traffic and any other rdma traffic in tcpdump
> or other standard sniffing tools.
>
> Anyway we brainstormed this internally today and we don't like the
> "skb->protocol = 0x" solution,

One solution would be to setup a special netdev used only for sniffers
(No IP address on it)

-> Only changes would happen in the driver, to set skb->dev to this
'debug' device.


Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag

2016-06-21 Thread Saeed Mahameed
On Tue, Jun 21, 2016 at 5:18 AM, Alexei Starovoitov
 wrote:
> On Sat, Jun 18, 2016 at 01:31:26AM +0300, Saeed Mahameed wrote:
>>
>> We simply want to selectively be able to see RoCE/RDMA ETH standard
>> traffic in tcpdump, for diagnostic purposes.
>> so in order to not overwhelm the kernel TCP/IP stack with this
>> traffic, this patch in particular
>> configures ConnectX4 hardware to tag those packets, so in downstream
>> patches mlx5 netdevice will mark the SKBs of those packets
>> to skip the TCP/IP stack and go only to tcpdump.
>
> such 'feature' doesn't make sense to me.
> 'not overwhelming' kernel, but to 'overwhelm' userspace tcpdump?
> Kernel can drop packets way faster than userspace, so such bypass
> scheme has no prartical usage other than building a first step towards
> complete dpdk-like bypass.
>

Alexei , I don't understand your concern.
We already have a full/complete working dpdk bypass solution in
userspace nothing extra is required from the kernel.

We just want to see this traffic and any other rdma traffic in tcpdump
or other standard sniffing tools.

Anyway we brainstormed this internally today and we don't like the
"skb->protocol = 0x" solution,
we will suggest a plugin for libpcap in user space to extend libpcap
ability to sniff RDMA/raw eth traffic.
This way userspace RDMA traffic will be sniffed also via a userspace
RDMA channel.

I will ask Dave to drop this series.


Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag

2016-06-20 Thread Alexei Starovoitov
On Sat, Jun 18, 2016 at 01:31:26AM +0300, Saeed Mahameed wrote:
> 
> We simply want to selectively be able to see RoCE/RDMA ETH standard
> traffic in tcpdump, for diagnostic purposes.
> so in order to not overwhelm the kernel TCP/IP stack with this
> traffic, this patch in particular
> configures ConnectX4 hardware to tag those packets, so in downstream
> patches mlx5 netdevice will mark the SKBs of those packets
> to skip the TCP/IP stack and go only to tcpdump.

such 'feature' doesn't make sense to me.
'not overwhelming' kernel, but to 'overwhelm' userspace tcpdump?
Kernel can drop packets way faster than userspace, so such bypass
scheme has no prartical usage other than building a first step towards
complete dpdk-like bypass.



Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag

2016-06-19 Thread Saeed Mahameed
On Sat, Jun 18, 2016 at 2:34 AM, Eric Dumazet  wrote:
> On Fri, Jun 17, 2016 at 3:31 PM, Saeed Mahameed
>  wrote:
>>
>> Today there are some bad usages and abuse to skb->protocol where some
>> device drivers set skb->protocol = 0xff to skip the kernel TCP/IP
>> processing for the same diagnostic purposes.
>> In this series we are just trying to do the right thing.
>
> Well, your patch adds an extra test in kernel fast path, just to ease
> the life of people using kernel bypass,
> but willing to use tcpdump because they can not figure to do this in
> user space properly.
>
> Please find a way  _not_ adding a single instruction in kernel fast path.
>

Well, we can set skb->protocol = 0x.
what do you think ?


Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag

2016-06-17 Thread Eric Dumazet
On Fri, Jun 17, 2016 at 3:31 PM, Saeed Mahameed
 wrote:
> On Fri, Jun 17, 2016 at 7:00 PM, Alexei Starovoitov
>  wrote:
>> On Fri, Jun 17, 2016 at 05:43:53PM +0300, Saeed Mahameed wrote:
>>> From: Maor Gottlieb 
>>>
>>> Add kernel offload flow tag for packets that will bypass the kernel
>>> stack, e.g (RoCE/RDMA/RAW ETH (DPDK), etc ..).
>>
>> so the whole series is an elaborate way to enable dpdk? how nice.
>
> NO, God forbids! the whole series has nothing to do with dpdk!
> Please read the cover letter.
>
> Quoting my own words from the cover letter:
> "This patch set introduces mlx5 RoCE/RDMA packet sniffer, it allows
> mlx5e netdevice to receive RoCE/RDMA or RAW ETH traffic which isn't
> supposed to be passed to the kernel stack, for sniffing and diagnostics
> purposes."
>
> We simply want to selectively be able to see RoCE/RDMA ETH standard
> traffic in tcpdump, for diagnostic purposes.
> so in order to not overwhelm the kernel TCP/IP stack with this
> traffic, this patch in particular
> configures ConnectX4 hardware to tag those packets, so in downstream
> patches mlx5 netdevice will mark the SKBs of those packets
> to skip the TCP/IP stack and go only to tcpdump.
>
> DPDK is not enabled/disabled or even slightly affected in this series.
> It was just given as an example in this patch commit message for
> traffic that can be sniffed in standard tools such as tcpdump.
>
> Today there are some bad usages and abuse to skb->protocol where some
> device drivers set skb->protocol = 0xff to skip the kernel TCP/IP
> processing for the same diagnostic purposes.
> In this series we are just trying to do the right thing.

Well, your patch adds an extra test in kernel fast path, just to ease
the life of people using kernel bypass,
but willing to use tcpdump because they can not figure to do this in
user space properly.

Please find a way  _not_ adding a single instruction in kernel fast path.

Thanks.


Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag

2016-06-17 Thread Saeed Mahameed
On Fri, Jun 17, 2016 at 7:00 PM, Alexei Starovoitov
 wrote:
> On Fri, Jun 17, 2016 at 05:43:53PM +0300, Saeed Mahameed wrote:
>> From: Maor Gottlieb 
>>
>> Add kernel offload flow tag for packets that will bypass the kernel
>> stack, e.g (RoCE/RDMA/RAW ETH (DPDK), etc ..).
>
> so the whole series is an elaborate way to enable dpdk? how nice.

NO, God forbids! the whole series has nothing to do with dpdk!
Please read the cover letter.

Quoting my own words from the cover letter:
"This patch set introduces mlx5 RoCE/RDMA packet sniffer, it allows
mlx5e netdevice to receive RoCE/RDMA or RAW ETH traffic which isn't
supposed to be passed to the kernel stack, for sniffing and diagnostics
purposes."

We simply want to selectively be able to see RoCE/RDMA ETH standard
traffic in tcpdump, for diagnostic purposes.
so in order to not overwhelm the kernel TCP/IP stack with this
traffic, this patch in particular
configures ConnectX4 hardware to tag those packets, so in downstream
patches mlx5 netdevice will mark the SKBs of those packets
to skip the TCP/IP stack and go only to tcpdump.

DPDK is not enabled/disabled or even slightly affected in this series.
It was just given as an example in this patch commit message for
traffic that can be sniffed in standard tools such as tcpdump.

Today there are some bad usages and abuse to skb->protocol where some
device drivers set skb->protocol = 0xff to skip the kernel TCP/IP
processing for the same diagnostic purposes.
In this series we are just trying to do the right thing.


Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag

2016-06-17 Thread John Fastabend
On 16-06-17 09:00 AM, Alexei Starovoitov wrote:
> On Fri, Jun 17, 2016 at 05:43:53PM +0300, Saeed Mahameed wrote:
>> From: Maor Gottlieb 
>>
>> Add kernel offload flow tag for packets that will bypass the kernel
>> stack, e.g (RoCE/RDMA/RAW ETH (DPDK), etc ..).
> 
> so the whole series is an elaborate way to enable dpdk? how nice.
> NACK.
> 

Well there is certainly room for augmenting the af_packet interfac with
hardware support.

Some things on my list (its a bit behind a few other things though) is
to align queues to af_packet sockets, put those queues in busy poll and
if possible look at zero copy RX. The problem with zero copy rx is it
bypasses the stack but we should be able to detect hooks being added on
ingress and disable it dynamically. Maybe I could look at this in a few
months but think about it for me I'm a bit busy lately. Also it requires
the driver to translate descriptor formats but I'm not convinced it is
that costly to do.

For DPDK why not just use SR-IOV like everyone else and bind a VF to
your favorite user space implementation (DPDK, NETMAP, PFRING, foobar0)
even Windows if you like. DPDK even supports this as far as I know.
Then you don't need to bother kernel folks at all. And you don't have
any overhead except from whatever your usermode code does.

Here's a really old patch I wrote that I would like to revisit at some
point,

---

>> This adds ndo ops for upper layer objects to request direct DMA from
>> the network interface into memory "slots". The slots must be DMA'able
>> memory given by a page/offset/size vector in a packet_ring_buffer
>> structure.
>>
>> The PF_PACKET socket interface can use these ndo_ops to do zerocopy
>> RX from the network device into memory mapped userspace memory. For
>> this to work ixgbe encodes the correct descriptor blocks and headers
>> so that existing PF_PACKET applications work without any modification.
>> This only supports the V2 header formats. And works by mapping a ring
>> of the network device to these slots.
>>
>> V3 header formats added bulk polling via socket calls and timers
>> used in the polling interface to return every n milliseconds. Currently,
>> I don't see any way to support this in hardware because we can't
>> know if the hardware is in the middle of a DMA operation or not
>> on a slot. So when a timer fires I don't know how to advance the
>> descriptor ring leaving empty descriptors similar to how the software
>> ring works. The easiest (best?) route is to simply not support this.
>>
>> The ndo operations and new socket option PACKET_RX_DIRECT work by
>> selecting a queue_index to run the direct dma operations over. Once
>> setsockopt returns sucessfully the indicated queue is mapped
>> directly to the requesting application and can not be used for
>> other purposes. Also any kernel layers such as BPF will be bypassed
>> and need to be implemented in the hardware via some other mechanism
>> such as flow classifier or future offload interfaces.
>>
>> Users steer traffic to the selected queue using flow director or
>> VMDQ. This needs to implemented through the ioctl flow classifier
>> interface (ethtool) or macvlan+hardware offload via netlink the
>> command line tool ip also supports macvlan+hardware_offload.
>>
>> The new socket option added to PF_PACKET is called PACKET_RX_DIRECT.
>> It takes a single unsigned int value specifying the queue index,
>>
>>  setsockopt(sock, SOL_PACKET, PACKET_RX_DIRECT,
>> &queue_index, sizeof(queue_index));
>>
>> To test this I modified the tool psock_tpacket in the selftests
>> kernel directory here:
>>
>>  ./tools/testing/selftests/net/psock_tpacket.c
>>
>> Running this tool opens a socket and listend for packets over
>> the PACKET_RX_DIRECT enabled socket.
>>
>> One more caveat is the ring size of ixgbe and the size used by
>> the software socket need to be the same. There is currently an
>> ethtool to configure this and a warnding is pushed to dmesg when
>> it is not the case. To set use an ioctl directly or call
>>
>>  ethtool -G ethx rx 
>>
>> where  is the number of configured slots. The demo program
>> psock_tpacket needs size=2400.
>>
>> Known Limitations (TBD):
>>
>>  (1) Users are required to match the number of rx ring
>>  slots with ethtool to the number requested by the
>>  setsockopt PF_PACKET layout. In the future we could
>>  possibly do this automatically. More importantly this
>>  setting is currently global for all rings and needs
>>  to be per queue.
>>
>>  (2) Users need to configure Flow director or VMDQ (macvlan)
>>  to steer traffic to the correct queues. I don't believe
>>  this needs to be changed it seems to be a good mechanism
>>  for driving ddma.
>>
>>  (3) Not supporting timestamps or priv space yet
>>
>>  (4) Not supporting BPF yet...
>>
>>  (5) Only RX supported so far. TX already supports direct DMA
>>  interface but uses skbs

Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag

2016-06-17 Thread Alexei Starovoitov
On Fri, Jun 17, 2016 at 05:43:53PM +0300, Saeed Mahameed wrote:
> From: Maor Gottlieb 
> 
> Add kernel offload flow tag for packets that will bypass the kernel
> stack, e.g (RoCE/RDMA/RAW ETH (DPDK), etc ..).

so the whole series is an elaborate way to enable dpdk? how nice.
NACK.