RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-05-19 Thread Jon Rosen (jrosen)
Forward link to a new proposed patch at:
https://www.mail-archive.com/netdev@vger.kernel.org/msg236629.html



RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Jon Rosen (jrosen)
> >> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
> >> >is that the documentation of the tp_status field is somewhat
> >> >inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
> >> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
> >> >meaning the entry is owned by user space.  In other places ownership
> >> >by user space is defined by the TP_STATUS_USER(1) bit being set.
> >>
> >> But indeed this example in packet_mmap.txt is problematic
> >>
> >> if (status == TP_STATUS_KERNEL)
> >> retval = poll(, 1, timeout);
> >>
> >> It does not really matter whether the docs are possibly inconsistent and
> >> which one is authoritative. Examples like the above make it likely that
> >> some user code expects such code to work.
> >
> > Yes, that's exactly my concern.  Yet another troubling example seems to be
> > lipbcap which also is looking specifically for status to be anything other 
> > than
> > TP_STATUS_KERNEL(0) to indicate a frame is available in user space.
> 
> Good catch. If pcap-linux.c relies on this then the status field
> cannot be changed. Other fields can be modified freely while tp_status
> remains 0, perhaps that's an option.

Possibly. Someone else suggested something similar but in at least the
one example we thought through it still seemed like it didn't address the 
problem.

For example, let's say we used tp_len == -1 to indicate to other kernel threads
that the entry was already in progress.  This would require that user space 
never
set tp_len = -1 before returning the entry back to the kernel.  If it did then 
no
kernel thread would ever claim ownership and the ring would hang.

Now, it seems pretty unlikely that user space would do such a thing so maybe we
could look past that, but then we run into the issue that there is still a 
window
of opportunity for other kernel threads to come in and wrap the ring.

The reason is we can't set tp_len to the correct length after setting tp_status 
because
user space could grab the entry and see tp_len == -1 so we have to set tp_len
before we set tp_status. This means that there is still a window where other
kernel threads could come in and see tp_len as something other than -1 and
a tp_status of TP_STATUS_KERNEL and think it's ok to allocate the entry.
This puts us back to where we are today (arguably with a smaller window,
but a window none the less).

Alternatively we could reacquire the spin_lock to then set tp_len followed by
tp_status.  This would give the necessary indivisibility in the kernel while 
preserving proper order as made visible to user space, but it comes at the cost
of another spin_lock.

Thanks for the suggestion.  If you can think of a way around this I'm all ears.
I'll think on this some more but so far I'm stuck on how to get past having to
broaden the scope of the spin_lock, reacquire the spin_lock, or use some sort
of atomic construct along with a parallel shadow ring structure (still thinking
through that one as well).



Re: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Willem de Bruijn
>> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
>> >is that the documentation of the tp_status field is somewhat
>> >inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
>> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
>> >meaning the entry is owned by user space.  In other places ownership
>> >by user space is defined by the TP_STATUS_USER(1) bit being set.
>>
>> But indeed this example in packet_mmap.txt is problematic
>>
>> if (status == TP_STATUS_KERNEL)
>> retval = poll(, 1, timeout);
>>
>> It does not really matter whether the docs are possibly inconsistent and
>> which one is authoritative. Examples like the above make it likely that
>> some user code expects such code to work.
>
> Yes, that's exactly my concern.  Yet another troubling example seems to be
> lipbcap which also is looking specifically for status to be anything other 
> than
> TP_STATUS_KERNEL(0) to indicate a frame is available in user space.

Good catch. If pcap-linux.c relies on this then the status field
cannot be changed. Other fields can be modified freely while tp_status
remains 0, perhaps that's an option.


RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Jon Rosen (jrosen)
On Wednesday, April 04, 2018 9:49 AM, Willem de Bruijn  
wrote:
> 
> On Tue, Apr 3, 2018 at 11:55 PM, Jon Rosen  wrote:
> > Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
> > casues the ring to get corrupted by allowing multiple kernel threads
> > to claim ownership of the same ring entry, Mark the ring entry as
> > already being used within the spin_lock to prevent other kernel
> > threads from reusing the same entry before it's fully filled in,
> > passed to user space, and then eventually passed back to the kernel
> > for use with a new packet.
> >
> > Note that the proposed change may modify the semantics of the
> > interface between kernel space and user space in a way which may cause
> > some applications to no longer work properly.
> 
> As long as TP_STATUS_USER (1) is not set, userspace should ignore
> the slot..
> 
> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
> >is that the documentation of the tp_status field is somewhat
> >inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
> >meaning the entry is owned by user space.  In other places ownership
> >by user space is defined by the TP_STATUS_USER(1) bit being set.
> 
> But indeed this example in packet_mmap.txt is problematic
> 
> if (status == TP_STATUS_KERNEL)
> retval = poll(, 1, timeout);
> 
> It does not really matter whether the docs are possibly inconsistent and
> which one is authoritative. Examples like the above make it likely that
> some user code expects such code to work.

Yes, that's exactly my concern.  Yet another troubling example seems to be
lipbcap which also is looking specifically for status to be anything other than
TP_STATUS_KERNEL(0) to indicate a frame is available in user space.

Either way things are broken. They are broken as they stand now because the
ring can get overrun and the kernel and user space tracking of the ring can
get out of sync.  And they are broken with the below change because some user
space applications will be looking for anything other than TP_STATUS_KERNEL,
so again the ring will get out of sync.

The difference here being that the way it is today is on average (across all 
environments
and across all user space apps) less likely to occur while with the change 
below it is
much more likely to occur.

Maybe the right answer here is to implement a fix that is compatible for 
existing
applications and accept any potential performance impacts and then add yet 
another
version (TPACKET_V4?) which more strictly requires the TP_STATUS_USER bit for
passing ownership.

> 
> > +++ b/net/packet/af_packet.c
> > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
> > net_device *dev,
> > if (po->stats.stats1.tp_drops)
> > status |= TP_STATUS_LOSING;
> > }
> > +
> > +/*
> > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> > + * kernel threads from re-using this same entry.
> > + */
> > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING
> 
> No need to reinterpret existing flags. tp_status is a u32 with
> sufficient undefined bits.

Agreed.

> 
> > +   if (po->tp_version <= TPACKET_V2)
> > +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> > +
> > po->stats.stats1.tp_packets++;
> > if (copy_skb) {
> > status |= TP_STATUS_COPY;
> > --
> > 2.10.3.dirty
> >

Thanks for the feedback!
Jon.


RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Jon Rosen (jrosen)


> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index e0f3f4a..264d7b2 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
> > net_device *dev,
> > if (po->stats.stats1.tp_drops)
> > status |= TP_STATUS_LOSING;
> > }
> > +
> > +/*
> > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> > + * kernel threads from re-using this same entry.
> > + */
> > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING
> > +   if (po->tp_version <= TPACKET_V2)
> > +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> > +
> > po->stats.stats1.tp_packets++;
> > if (copy_skb) {
> > status |= TP_STATUS_COPY;
> 
> This patch looks correct. Please resend it with proper signed-off-by
> and with a kernel code indenting style (tabs).  Is this bug present
> since the beginning of af_packet and multiqueue devices or did it get
> introduced in some previous kernel?

Sorry about the tabs, I'll fix that and try to figure out what I did wrong with
the signed-off-by.

I've looked back as far as I could find online (2.6.11) and it would appear that
this bug has always been there.

Thanks, jon.



Re: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Willem de Bruijn
On Tue, Apr 3, 2018 at 11:55 PM, Jon Rosen  wrote:
> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
> casues the ring to get corrupted by allowing multiple kernel threads
> to claim ownership of the same ring entry, Mark the ring entry as
> already being used within the spin_lock to prevent other kernel
> threads from reusing the same entry before it's fully filled in,
> passed to user space, and then eventually passed back to the kernel
> for use with a new packet.
>
> Note that the proposed change may modify the semantics of the
> interface between kernel space and user space in a way which may cause
> some applications to no longer work properly.

As long as TP_STATUS_USER (1) is not set, userspace should ignore
the slot..

>One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
>is that the documentation of the tp_status field is somewhat
>inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
>meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
>meaning the entry is owned by user space.  In other places ownership
>by user space is defined by the TP_STATUS_USER(1) bit being set.

But indeed this example in packet_mmap.txt is problematic

if (status == TP_STATUS_KERNEL)
retval = poll(, 1, timeout);

It does not really matter whether the docs are possibly inconsistent and
which one is authoritative. Examples like the above make it likely that
some user code expects such code to work.

> +++ b/net/packet/af_packet.c
> @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
> net_device *dev,
> if (po->stats.stats1.tp_drops)
> status |= TP_STATUS_LOSING;
> }
> +
> +/*
> + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> + * kernel threads from re-using this same entry.
> + */
> +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING

No need to reinterpret existing flags. tp_status is a u32 with
sufficient undefined bits.

> +   if (po->tp_version <= TPACKET_V2)
> +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> +
> po->stats.stats1.tp_packets++;
> if (copy_skb) {
> status |= TP_STATUS_COPY;
> --
> 2.10.3.dirty
>


Re: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-03 Thread Stephen Hemminger
On Tue,  3 Apr 2018 17:55:25 -0400
Jon Rosen  wrote:

> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
> casues the ring to get corrupted by allowing multiple kernel threads
> to claim ownership of the same ring entry, Mark the ring entry as
> already being used within the spin_lock to prevent other kernel
> threads from reusing the same entry before it's fully filled in,
> passed to user space, and then eventually passed back to the kernel
> for use with a new packet.
> 
> Note that the proposed change may modify the semantics of the
> interface between kernel space and user space in a way which may cause
> some applications to no longer work properly. More discussion on this
> change can be found in the additional comments section titled
> "3. Discussion on packet_mmap ownership semantics:".
> 
> Signed-off-by: Jon Rosen 
> ---
> 
> Additional Comments Section
> ---
> 
> 1. Description of the diffs:
> 
> 
>TPACKET_V1 and TPACKET_V2 format rings:
>---
>Mark each entry as TP_STATUS_IN_PROGRESS after allocating to
>prevent other kernel threads from re-using the same entry.
> 
>This is necessary because there may be a delay from the time the
>spin_lock is released to the time that the packet is completed and
>the corresponding ring entry is marked as owned by user space.  If
>during this time other kernel threads enqueue more packets to the
>ring than the size of the ring then it will cause mutliple kernel
>threads to operate on the same entry at the same time, corrupting
>packets and the ring state.
> 
>By marking the entry as allocated (IN_PROGRESS) we prevent other
>kernel threads from incorrectly re-using an entry that is still in
>the progress of being filled in before it is passed to user space.
> 
>This forces each entry through the following states:
> 
>+-> 1. (tp_status == TP_STATUS_KERNEL)
>|  Free: For use by any kernel thread to store a new packet
>|
>|   2. !(tp_status == TP_STATUS_KERNEL) && !(tp_status & TP_STATUS_USER)
>|  Allocated: In use by a *specific* kernel thread
>|
>|   3. (tp_status & TP_STATUS_USER)
>|  Available: Packet available for user space to process
>|
>+-- Loop back to #1 when user space writes entry as TP_STATUS_KERNEL
> 
> 
>No impact on TPACKET_V3 format rings:
>-
>Packet entry ownership is already protected from other kernel
>threads potentially re-using the same entry. This is done inside
>packet_current_rx_frame() where storage is allocated for the
>current packet. Since this is done within the spin_lock no
>additional status updates for this entry are required.
> 
> 
>Defining TP_STATUS_IN_PROGRESS:
>---
>Rather than defining a new-bit we re-use an existing bit for this
>intermediate state.  Status will eventually be overwritten with the
>actual true status when passed to user space.  Any bit used to pass
>information to user space other than the one that passes ownership
>is suitable (can't use TP_STATUS_USER).  Alternatively a new bit
>could be defined.
> 
> 
> 2. More detailed discussion:
> 
>Ring entries basically have 2 states, owned by the kernel or owned by
>user space. For single producer/single consumer this works fine. For
>multiple producers there is a window between the call to spin_unlock
>[F] and the call to __packet_set_status [J] where if there are enough
>packets added to the ring by other kernel threads then the ring can
>wrap and multiple threads will end up using the same ring entries.
> 
>This occurs because the ring entry alocated at [C] did not modify the
>state of the entry so it continues to appear as owned by the kernel
>and available for use for new packets even though it has already been
>allocated.
> 
>A simple fix is to temporarily mark the ring entries within the spin
>lock such that user space will still think it?s owned by the kernel
>and other kernel threads will not see it as available to be used for
>new packets. If a kernel thread gets delayed between [F] and [J] for
>an extended period of time and the ring wraps back to the same point
>then subsiquent kernel threads attempts to allocate will fail and be
>treated as the ring being full.
> 
>The change below at [D] uses a newly defined TP_STATUS_IN_PROGRESS bit
>to prevent other kernel threads from re-using the same entry. Note that
>any existing bit other than TP_STATUS_USER could have been used.
> 
>af_packet.c:tpacket_rcv()
>   ... code removed for brevity ...
> 
>   // Acquire spin lock
> A:spin_lock(>sk_receive_queue.lock);
> 
> // Preemption is disabled
> 
>