RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-23 Thread Jon Rosen (jrosen)


> -Original Message-
> From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com]
> Sent: Wednesday, May 23, 2018 9:37 AM
> To: Jon Rosen (jrosen) <jro...@cisco.com>
> Cc: David S. Miller <da...@davemloft.net>; Willem de Bruijn 
> <will...@google.com>; Eric Dumazet
> <eduma...@google.com>; Kees Cook <keesc...@chromium.org>; David Windsor 
> <dwind...@gmail.com>; Rosen,
> Rami <rami.ro...@intel.com>; Reshetova, Elena <elena.reshet...@intel.com>; 
> Mike Maloney
> <malo...@google.com>; Benjamin Poirier <bpoir...@suse.com>; Thomas Gleixner 
> <t...@linutronix.de>; Greg
> Kroah-Hartman <gre...@linuxfoundation.org>; open list:NETWORKING [GENERAL] 
> <netdev@vger.kernel.org>;
> open list <linux-ker...@vger.kernel.org>
> Subject: Re: [PATCH v2] packet: track ring entry use using a shadow ring to 
> prevent RX ring overrun
> 
> On Wed, May 23, 2018 at 7:54 AM, Jon Rosen (jrosen) <jro...@cisco.com> wrote:
> >> > For the ring, there is no requirement to allocate exactly the amount
> >> > specified by the user request. Safer than relying on shared memory
> >> > and simpler than the extra allocation in this patch would be to allocate
> >> > extra shadow memory at the end of the ring (and not mmap that).
> >> >
> >> > That still leaves an extra cold cacheline vs using tp_padding.
> >>
> >> Given my lack of experience and knowledge in writing kernel code
> >> it was easier for me to allocate the shadow ring as a separate
> >> structure.  Of course it's not about me and my skills so if it's
> >> more appropriate to allocate at the tail of the existing ring
> >> then certainly I can look at doing that.
> >
> > The memory for the ring is not one contiguous block, it's an array of
> > blocks of pages (or 'order' sized blocks of pages). I don't think
> > increasing the size of each of the blocks to provided storage would be
> > such a good idea as it will risk spilling over into the next order and
> > wasting lots of memory. I suspect it's also more complex than a single
> > shadow ring to do both the allocation and the access.
> >
> > It could be tacked onto the end of the pg_vec[] used to store the
> > pointers to the blocks. The challenge with that is that a pg_vec[] is
> > created for each of RX and TX rings so either it would have to
> > allocate unnecessary storage for TX or the caller will have to say if
> > extra space should be allocated or not.  E.g.:
> >
> > static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order, int 
> > scratch, void **scratch_p)
> >
> > I'm not sure avoiding the extra allocation and moving it to the
> > pg_vec[] for the RX ring is going to get the simplification you were
> > hoping for.  Is there another way of storing the shadow ring which
> > I should consider?
> 
> I did indeed mean attaching extra pages to pg_vec[]. It should be
> simpler than a separate structure, but I may be wrong.

I don't think it would be too bad, it may actually turn out to be
convenient to implement.

> 
> Either way, I still would prefer to avoid the shadow buffer completely.
> It incurs complexity and cycle cost on all users because of only the
> rare (non-existent?) consumer that overwrites the padding bytes.

I prefer that as well.  I'm just not sure there is a bulletproof
solution without the shadow state.  I also wish it were only a
theoretical issue but unfortunately it is actually something our
customers have seen.
> 
> Perhaps we can use padding yet avoid deadlock by writing a
> timed value. The simplest would be jiffies >> N. Then only a
> process that writes this exact value would be subject to drops and
> then still only for a limited period.
> 
> Instead of depending on wall clock time, like jiffies, another option
> would be to keep a percpu array of values. Each cpu has a zero
> entry if it is not writing, nonzero if it is. If a writer encounters a
> number in padding that is > num_cpus, then the state is garbage
> from userspace. If <= num_cpus, it is adhered to only until that cpu
> clears its entry, which is guaranteed to happen eventually.
> 
> Just a quick thought. This might not fly at all upon closer scrutiny.

I'm not sure I understand the suggestion, but I'll think on it
some more.

Some other options maybe worth considering (in no specific order):
- test the application to see if it will consume entries if tp_status
  is set to anything other than TP_STATUS_USER, only use shadow if
  it doesn't strictly honor the TP_STATUS_USER bit.

- skip shadow if we see new TP_STATUS_USER_TO_KERNEL is used

- use tp_len == -1 to indicate inuse





RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-23 Thread Jon Rosen (jrosen)
> > For the ring, there is no requirement to allocate exactly the amount
> > specified by the user request. Safer than relying on shared memory
> > and simpler than the extra allocation in this patch would be to allocate
> > extra shadow memory at the end of the ring (and not mmap that).
> >
> > That still leaves an extra cold cacheline vs using tp_padding.
> 
> Given my lack of experience and knowledge in writing kernel code
> it was easier for me to allocate the shadow ring as a separate
> structure.  Of course it's not about me and my skills so if it's
> more appropriate to allocate at the tail of the existing ring
> then certainly I can look at doing that.

The memory for the ring is not one contiguous block, it's an array of
blocks of pages (or 'order' sized blocks of pages). I don't think
increasing the size of each of the blocks to provided storage would be
such a good idea as it will risk spilling over into the next order and
wasting lots of memory. I suspect it's also more complex than a single
shadow ring to do both the allocation and the access.

It could be tacked onto the end of the pg_vec[] used to store the
pointers to the blocks. The challenge with that is that a pg_vec[] is
created for each of RX and TX rings so either it would have to
allocate unnecessary storage for TX or the caller will have to say if
extra space should be allocated or not.  E.g.:

static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order, int 
scratch, void **scratch_p)

I'm not sure avoiding the extra allocation and moving it to the
pg_vec[] for the RX ring is going to get the simplification you were
hoping for.  Is there another way of storing the shadow ring which
I should consider?


RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-23 Thread Jon Rosen (jrosen)
> >>> I think the bigger issues as you've pointed out are the cost of
> >>> the additional spin lock and should the additional state be
> >>> stored in-band (fewer cache lines) or out-of band (less risk of
> >>> breaking due to unpredictable application behavior).
> >>
> >> We don't need the spinlock if clearing the shadow byte after
> >> setting the status to user.
> >>
> >> Worst case, user will set it back to kernel while the shadow
> >> byte is not cleared yet and the next producer will drop a packet.
> >> But next producers will make progress, so there is no deadlock
> >> or corruption.
> >
> > I thought so too for a while but after spending more time than I
> > care to admit I relized the following sequence was occuring:
> >
> >Core A   Core B
> >--   --
> >- Enter spin_lock
> >-   Get tp_status of head (X)
> >tp_status == 0
> >-   Check inuse
> >inuse == 0
> >-   Allocate entry X
> >advance head (X+1)
> >set inuse=1
> >- Exit spin_lock
> >
> >  
> >
> >  > where N = size of ring>
> >
> > - Enter spin_lock
> > -   get tp_status of head (X+N)
> > tp_status == 0 (but slot
> > in use for X on core A)
> >
> >- write tp_status of <--- trouble!
> >  X = TP_STATUS_USER <--- trouble!
> >- write inuse=0  <--- trouble!
> >
> > -   Check inuse
> > inuse == 0
> > -   Allocate entry X+N
> > advance head (X+N+1)
> > set inuse=1
> > - Exit spin_lock
> >
> >
> > At this point Core A just passed slot X to userspace with a
> > packet and Core B has just been assigned slot X+N (same slot as
> > X) for it's new packet. Both cores A and B end up filling in that
> > slot.  Tracking ths donw was one of the reasons it took me a
> > while to produce these updated diffs.
> 
> Is this not just an ordering issue? Since inuse is set after tp_status,
> it has to be tested first (and barriers are needed to avoid reordering).

I changed the code as you suggest to do the inuse check first and
removed the extra added spin_lock/unlock and it seems to be working.
I was able to run through the night without an issue (normally I would
hit the ring corruption in 1 to 2 hours).

Thanks for pointing that out, I should have caught that myself.  Next
I'll look at your suggestion for where to put the shadow ring.


RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-22 Thread Jon Rosen (jrosen)
On Monday, May 21, 2018 2:17 PM, Jon Rosen (jrosen) <jro...@cisco.com> wrote:
> On Monday, May 21, 2018 1:07 PM, Willem de Bruijn
> <willemdebruijn.ker...@gmail.com> wrote:
>> On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen) <jro...@cisco.com> wrote:

...snip...

>>
>> A setsockopt for userspace to signal a stricter interpretation of
>> tp_status to elide the shadow hack could then be considered.
>> It's not pretty. Either way, no full new version is required.
>>
>>> As much as I would like to find a solution that doesn't require
>>> the spin lock I have yet to do so. Maybe the answer is that
>>> existing applications will need to suffer the performance impact
>>> but a new version or option for TPACKET_V1/V2 could be added to
>>> indicate strict adherence of the TP_STATUS_USER bit and then the
>>> original diffs could be used.

It looks like adding new socket options is pretty rare so I
wonder if a better option might be to define a new TP_STATUS_XXX
bit which would signal from a userspace application to the kernel
that it strictly interprets the TP_STATUS_USER bit to determine
ownership.

Todays applications set tp_status = TP_STATUS_KERNEL(0) for the
kernel to pick up the entry.  We could define a new value to pass
ownership as well as one to indicate to other kernel threads that
an entry is inuse:

#define TP_STATUS_USER_TO_KERNEL(1 << 8)
#define TP_STATUS_INUSE (1 << 9)

If the kernel sees tp_status == TP_STATUS_KERNEL then it should
use the shadow method for tacking ownership. If it sees tp_status
== TP_STATUS_USER_TO_KERNEL then it can use the TP_STATUS_INUSE
method.

>>>
>>> There is another option I was considering but have yet to try
>>> which would avoid needing a shadow ring by using counter(s) to
>>> track maximum sequence number queued to userspace vs. the next
>>> sequence number to be allocated in the ring.  If the difference
>>> is greater than the size of the ring then the ring can be
>>> considered full and the allocation would fail. Of course this may
>>> create an additional hotspot between cores, not sure if that
>>> would be significant or not.
>>
>> Please do have a look, but I don't think that this will work in this
>> case in practice. It requires tracking the producer tail. Updating
>> the slowest writer requires probing each subsequent slot's status
>> byte to find the new tail, which is a lot of (by then cold) cacheline
>> reads.
>
> I've thought about it a little more and am not convinced it's
> workable but I'll spend a little more time on it before giving
> up.

I've given up on this method.  Just don't see how to make it work.



RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-21 Thread Jon Rosen (jrosen)
On Monday, May 21, 2018 1:07 PM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
>On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen) <jro...@cisco.com> wrote:
>> On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn
>> <willemdebruijn.ker...@gmail.com> wrote:
>>> On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn
>>> <willemdebruijn.ker...@gmail.com> wrote:
>>>> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jro...@cisco.com> wrote:
>>>>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
>>>>> casues the ring to get corrupted by allowing multiple kernel threads
>>>>> to claim ownership of the same ring entry. Track ownership in a shadow
>>>>> ring structure to prevent other kernel threads from reusing the same
>>>>> entry before it's fully filled in, passed to user space, and then
>>>>> eventually passed back to the kernel for use with a new packet.
>>>>>
>>>>> Signed-off-by: Jon Rosen <jro...@cisco.com>
>>>>> ---
>>>>>
>>>>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
>>>>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2.  This bug makes
>>>>> it possible for multiple kernel threads to claim ownership of the same
>>>>> ring entry, corrupting the ring and the corresponding packet(s).
>>>>>
>>>>> These diffs are the second proposed solution, previous proposal was 
>>>>> described
>>>>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html
>>>>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
>>>>> to prevent RX ring overrun
>>>>>
>>>>> Those diffs would have changed the binary interface and have broken 
>>>>> certain
>>>>> applications. Consensus was that such a change would be inappropriate.
>>>>>
>>>>> These new diffs use a shadow ring in kernel space for tracking 
>>>>> intermediate
>>>>> state of an entry and prevent more than one kernel thread from 
>>>>> simultaneously
>>>>> allocating a ring entry. This avoids any impact to the binary interface
>>>>> between kernel and userspace but comes at the additional cost of 
>>>>> requiring a
>>>>> second spin_lock when passing ownership of a ring entry to userspace.
>>>>>
>>>>> Jon Rosen (1):
>>>>>   packet: track ring entry use using a shadow ring to prevent RX ring
>>>>> overrun
>>>>>
>>>>>  net/packet/af_packet.c | 64 
>>>>> ++
>>>>>  net/packet/internal.h  | 14 +++
>>>>>  2 files changed, 78 insertions(+)
>>>>>
>>>>
>>>>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
>>>>> net_device *dev,
>>>>>  #endif
>>>>>
>>>>> if (po->tp_version <= TPACKET_V2) {
>>>>> +   spin_lock(>sk_receive_queue.lock);
>>>>> __packet_set_status(po, h.raw, status);
>>>>> +   packet_rx_shadow_release(rx_shadow_ring_entry);
>>>>> +   spin_unlock(>sk_receive_queue.lock);
>>>>> +
>>>>> sk->sk_data_ready(sk);
>>>>
>>>> Thanks for continuing to look at this. I spent some time on it last time
>>>> around but got stuck, too.
>>>>
>>>> This version takes an extra spinlock in the hot path. That will be very
>>>> expensive. Once we need to accept that, we could opt for a simpler
>>>> implementation akin to the one discussed in the previous thread:
>>>>
>>>> stash a value in tp_padding or similar while tp_status remains
>>>> TP_STATUS_KERNEL to signal ownership to concurrent kernel
>>>> threads. The issue previously was that that field could not atomically
>>>> be cleared together with __packet_set_status. This is no longer
>>>> an issue when holding the queue lock.
>>>>
>>>> With a field like tp_padding, unlike tp_len, it is arguably also safe to
>>>> clear it after flipping status (userspace should treat it as undefined).
>>>>
>>>> With v1 tpacket_hdr, no explicit padding field is defined but due to
>>&g

RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-21 Thread Jon Rosen (jrosen)
On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn
> <willemdebruijn.ker...@gmail.com> wrote:
>> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jro...@cisco.com> wrote:
>>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
>>> casues the ring to get corrupted by allowing multiple kernel threads
>>> to claim ownership of the same ring entry. Track ownership in a shadow
>>> ring structure to prevent other kernel threads from reusing the same
>>> entry before it's fully filled in, passed to user space, and then
>>> eventually passed back to the kernel for use with a new packet.
>>>
>>> Signed-off-by: Jon Rosen <jro...@cisco.com>
>>> ---
>>>
>>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
>>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2.  This bug makes
>>> it possible for multiple kernel threads to claim ownership of the same
>>> ring entry, corrupting the ring and the corresponding packet(s).
>>>
>>> These diffs are the second proposed solution, previous proposal was 
>>> described
>>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html
>>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
>>> to prevent RX ring overrun
>>>
>>> Those diffs would have changed the binary interface and have broken certain
>>> applications. Consensus was that such a change would be inappropriate.
>>>
>>> These new diffs use a shadow ring in kernel space for tracking intermediate
>>> state of an entry and prevent more than one kernel thread from 
>>> simultaneously
>>> allocating a ring entry. This avoids any impact to the binary interface
>>> between kernel and userspace but comes at the additional cost of requiring a
>>> second spin_lock when passing ownership of a ring entry to userspace.
>>>
>>> Jon Rosen (1):
>>>   packet: track ring entry use using a shadow ring to prevent RX ring
>>> overrun
>>>
>>>  net/packet/af_packet.c | 64 
>>> ++
>>>  net/packet/internal.h  | 14 +++
>>>  2 files changed, 78 insertions(+)
>>>
>>
>>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
>>> net_device *dev,
>>>  #endif
>>>
>>> if (po->tp_version <= TPACKET_V2) {
>>> +   spin_lock(>sk_receive_queue.lock);
>>> __packet_set_status(po, h.raw, status);
>>> +   packet_rx_shadow_release(rx_shadow_ring_entry);
>>> +   spin_unlock(>sk_receive_queue.lock);
>>> +
>>> sk->sk_data_ready(sk);
>>
>> Thanks for continuing to look at this. I spent some time on it last time
>> around but got stuck, too.
>>
>> This version takes an extra spinlock in the hot path. That will be very
>> expensive. Once we need to accept that, we could opt for a simpler
>> implementation akin to the one discussed in the previous thread:
>>
>> stash a value in tp_padding or similar while tp_status remains
>> TP_STATUS_KERNEL to signal ownership to concurrent kernel
>> threads. The issue previously was that that field could not atomically
>> be cleared together with __packet_set_status. This is no longer
>> an issue when holding the queue lock.
>>
>> With a field like tp_padding, unlike tp_len, it is arguably also safe to
>> clear it after flipping status (userspace should treat it as undefined).
>>
>> With v1 tpacket_hdr, no explicit padding field is defined but due to
>> TPACKET_HDRLEN alignment it exists on both 32 and 64 bit
>> platforms.
>>
>> The danger with using padding is that a process may write to it
>> and cause deadlock, of course. There is no logical reason for doing
>> so.
>
> For the ring, there is no requirement to allocate exactly the amount
> specified by the user request. Safer than relying on shared memory
> and simpler than the extra allocation in this patch would be to allocate
> extra shadow memory at the end of the ring (and not mmap that).
>
> That still leaves an extra cold cacheline vs using tp_padding.

Given my lack of experience and knowledge in writing kernel code
it was easier for me to allocate the shadow ring as a separate
structure.  Of course it's not about me and my skills so if it's
more appropriate to allocate at the tai

RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-05-19 Thread Jon Rosen (jrosen)
Forward link to a new proposed patch at:
https://www.mail-archive.com/netdev@vger.kernel.org/msg236629.html



[PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-19 Thread Jon Rosen
Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
casues the ring to get corrupted by allowing multiple kernel threads
to claim ownership of the same ring entry. Track ownership in a shadow
ring structure to prevent other kernel threads from reusing the same
entry before it's fully filled in, passed to user space, and then
eventually passed back to the kernel for use with a new packet.

Signed-off-by: Jon Rosen <jro...@cisco.com>
---

There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2.  This bug makes
it possible for multiple kernel threads to claim ownership of the same
ring entry, corrupting the ring and the corresponding packet(s).

These diffs are the second proposed solution, previous proposal was described
in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html
subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
to prevent RX ring overrun

Those diffs would have changed the binary interface and have broken certain
applications. Consensus was that such a change would be inappropriate.

These new diffs use a shadow ring in kernel space for tracking intermediate
state of an entry and prevent more than one kernel thread from simultaneously
allocating a ring entry. This avoids any impact to the binary interface
between kernel and userspace but comes at the additional cost of requiring a
second spin_lock when passing ownership of a ring entry to userspace.

Jon Rosen (1):
  packet: track ring entry use using a shadow ring to prevent RX ring
overrun

 net/packet/af_packet.c | 64 ++
 net/packet/internal.h  | 14 +++
 2 files changed, 78 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index e0f3f4a..4d08c8e 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2165,6 +2165,26 @@ static int packet_rcv(struct sk_buff *skb, struct 
net_device *dev,
return 0;
 }
 
+static inline void *packet_rx_shadow_aquire_head(struct packet_sock *po)
+{
+   struct packet_ring_shadow_entry *entry;
+
+   entry = >rx_shadow.ring[po->rx_ring.head];
+   if (unlikely(entry->inuse))
+   return NULL;
+
+   entry->inuse = 1;
+   return (void *)entry;
+}
+
+static inline void packet_rx_shadow_release(void *_entry)
+{
+   struct packet_ring_shadow_entry *entry;
+
+   entry = (struct packet_ring_shadow_entry *)_entry;
+   entry->inuse = 0;
+}
+
 static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
   struct packet_type *pt, struct net_device *orig_dev)
 {
@@ -2182,6 +2202,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
__u32 ts_status;
bool is_drop_n_account = false;
bool do_vnet = false;
+   void *rx_shadow_ring_entry = NULL;
 
/* struct tpacket{2,3}_hdr is aligned to a multiple of 
TPACKET_ALIGNMENT.
 * We may add members to them until current aligned size without forcing
@@ -2277,7 +2298,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
if (!h.raw)
goto drop_n_account;
if (po->tp_version <= TPACKET_V2) {
+   /* Attempt to allocate shadow ring entry.
+* If already inuse then the ring is full.
+*/
+   rx_shadow_ring_entry = packet_rx_shadow_aquire_head(po);
+   if (unlikely(!rx_shadow_ring_entry))
+   goto ring_is_full;
+
packet_increment_rx_head(po, >rx_ring);
+
/*
 * LOSING will be reported till you read the stats,
 * because it's COR - Clear On Read.
@@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
 #endif
 
if (po->tp_version <= TPACKET_V2) {
+   spin_lock(>sk_receive_queue.lock);
__packet_set_status(po, h.raw, status);
+   packet_rx_shadow_release(rx_shadow_ring_entry);
+   spin_unlock(>sk_receive_queue.lock);
+
sk->sk_data_ready(sk);
} else {
prb_clear_blk_fill_status(>rx_ring);
@@ -4197,6 +4230,25 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, 
int order)
goto out;
 }
 
+static struct packet_ring_shadow_entry *
+   packet_rx_shadow_alloc(unsigned int tp_frame_nr)
+{
+   struct packet_ring_shadow_entry *rx_shadow_ring;
+   int ring_size;
+   int i;
+
+   ring_size = tp_frame_nr * sizeof(*rx_shadow_ring);
+   rx_shadow_ring = kmalloc(ring_size, GFP_KERNEL);
+
+   if (!rx_shadow_ring)
+   return NULL;
+
+   for (i = 0; i < tp_frame_nr; i++)
+   rx_shadow_ring[i].inuse = 0;
+
+   return rx_shadow_ring;
+}
+
 static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,

RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Jon Rosen (jrosen)
> >> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
> >> >is that the documentation of the tp_status field is somewhat
> >> >inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
> >> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
> >> >meaning the entry is owned by user space.  In other places ownership
> >> >by user space is defined by the TP_STATUS_USER(1) bit being set.
> >>
> >> But indeed this example in packet_mmap.txt is problematic
> >>
> >> if (status == TP_STATUS_KERNEL)
> >> retval = poll(, 1, timeout);
> >>
> >> It does not really matter whether the docs are possibly inconsistent and
> >> which one is authoritative. Examples like the above make it likely that
> >> some user code expects such code to work.
> >
> > Yes, that's exactly my concern.  Yet another troubling example seems to be
> > lipbcap which also is looking specifically for status to be anything other 
> > than
> > TP_STATUS_KERNEL(0) to indicate a frame is available in user space.
> 
> Good catch. If pcap-linux.c relies on this then the status field
> cannot be changed. Other fields can be modified freely while tp_status
> remains 0, perhaps that's an option.

Possibly. Someone else suggested something similar but in at least the
one example we thought through it still seemed like it didn't address the 
problem.

For example, let's say we used tp_len == -1 to indicate to other kernel threads
that the entry was already in progress.  This would require that user space 
never
set tp_len = -1 before returning the entry back to the kernel.  If it did then 
no
kernel thread would ever claim ownership and the ring would hang.

Now, it seems pretty unlikely that user space would do such a thing so maybe we
could look past that, but then we run into the issue that there is still a 
window
of opportunity for other kernel threads to come in and wrap the ring.

The reason is we can't set tp_len to the correct length after setting tp_status 
because
user space could grab the entry and see tp_len == -1 so we have to set tp_len
before we set tp_status. This means that there is still a window where other
kernel threads could come in and see tp_len as something other than -1 and
a tp_status of TP_STATUS_KERNEL and think it's ok to allocate the entry.
This puts us back to where we are today (arguably with a smaller window,
but a window none the less).

Alternatively we could reacquire the spin_lock to then set tp_len followed by
tp_status.  This would give the necessary indivisibility in the kernel while 
preserving proper order as made visible to user space, but it comes at the cost
of another spin_lock.

Thanks for the suggestion.  If you can think of a way around this I'm all ears.
I'll think on this some more but so far I'm stuck on how to get past having to
broaden the scope of the spin_lock, reacquire the spin_lock, or use some sort
of atomic construct along with a parallel shadow ring structure (still thinking
through that one as well).



RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Jon Rosen (jrosen)
On Wednesday, April 04, 2018 9:49 AM, Willem de Bruijn <will...@google.com> 
wrote:
> 
> On Tue, Apr 3, 2018 at 11:55 PM, Jon Rosen <jro...@cisco.com> wrote:
> > Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
> > casues the ring to get corrupted by allowing multiple kernel threads
> > to claim ownership of the same ring entry, Mark the ring entry as
> > already being used within the spin_lock to prevent other kernel
> > threads from reusing the same entry before it's fully filled in,
> > passed to user space, and then eventually passed back to the kernel
> > for use with a new packet.
> >
> > Note that the proposed change may modify the semantics of the
> > interface between kernel space and user space in a way which may cause
> > some applications to no longer work properly.
> 
> As long as TP_STATUS_USER (1) is not set, userspace should ignore
> the slot..
> 
> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
> >is that the documentation of the tp_status field is somewhat
> >inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
> >meaning the entry is owned by user space.  In other places ownership
> >by user space is defined by the TP_STATUS_USER(1) bit being set.
> 
> But indeed this example in packet_mmap.txt is problematic
> 
> if (status == TP_STATUS_KERNEL)
> retval = poll(, 1, timeout);
> 
> It does not really matter whether the docs are possibly inconsistent and
> which one is authoritative. Examples like the above make it likely that
> some user code expects such code to work.

Yes, that's exactly my concern.  Yet another troubling example seems to be
lipbcap which also is looking specifically for status to be anything other than
TP_STATUS_KERNEL(0) to indicate a frame is available in user space.

Either way things are broken. They are broken as they stand now because the
ring can get overrun and the kernel and user space tracking of the ring can
get out of sync.  And they are broken with the below change because some user
space applications will be looking for anything other than TP_STATUS_KERNEL,
so again the ring will get out of sync.

The difference here being that the way it is today is on average (across all 
environments
and across all user space apps) less likely to occur while with the change 
below it is
much more likely to occur.

Maybe the right answer here is to implement a fix that is compatible for 
existing
applications and accept any potential performance impacts and then add yet 
another
version (TPACKET_V4?) which more strictly requires the TP_STATUS_USER bit for
passing ownership.

> 
> > +++ b/net/packet/af_packet.c
> > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
> > net_device *dev,
> > if (po->stats.stats1.tp_drops)
> > status |= TP_STATUS_LOSING;
> > }
> > +
> > +/*
> > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> > + * kernel threads from re-using this same entry.
> > + */
> > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING
> 
> No need to reinterpret existing flags. tp_status is a u32 with
> sufficient undefined bits.

Agreed.

> 
> > +   if (po->tp_version <= TPACKET_V2)
> > +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> > +
> > po->stats.stats1.tp_packets++;
> > if (copy_skb) {
> > status |= TP_STATUS_COPY;
> > --
> > 2.10.3.dirty
> >

Thanks for the feedback!
Jon.


RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Jon Rosen (jrosen)


> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index e0f3f4a..264d7b2 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
> > net_device *dev,
> > if (po->stats.stats1.tp_drops)
> > status |= TP_STATUS_LOSING;
> > }
> > +
> > +/*
> > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> > + * kernel threads from re-using this same entry.
> > + */
> > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING
> > +   if (po->tp_version <= TPACKET_V2)
> > +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> > +
> > po->stats.stats1.tp_packets++;
> > if (copy_skb) {
> > status |= TP_STATUS_COPY;
> 
> This patch looks correct. Please resend it with proper signed-off-by
> and with a kernel code indenting style (tabs).  Is this bug present
> since the beginning of af_packet and multiqueue devices or did it get
> introduced in some previous kernel?

Sorry about the tabs, I'll fix that and try to figure out what I did wrong with
the signed-off-by.

I've looked back as far as I could find online (2.6.11) and it would appear that
this bug has always been there.

Thanks, jon.



[RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-03 Thread Jon Rosen
Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
casues the ring to get corrupted by allowing multiple kernel threads
to claim ownership of the same ring entry, Mark the ring entry as
already being used within the spin_lock to prevent other kernel
threads from reusing the same entry before it's fully filled in,
passed to user space, and then eventually passed back to the kernel
for use with a new packet.

Note that the proposed change may modify the semantics of the
interface between kernel space and user space in a way which may cause
some applications to no longer work properly. More discussion on this
change can be found in the additional comments section titled
"3. Discussion on packet_mmap ownership semantics:".

Signed-off-by: Jon Rosen <jro...@cisco.com>
---

Additional Comments Section
---

1. Description of the diffs:


   TPACKET_V1 and TPACKET_V2 format rings:
   ---
   Mark each entry as TP_STATUS_IN_PROGRESS after allocating to
   prevent other kernel threads from re-using the same entry.

   This is necessary because there may be a delay from the time the
   spin_lock is released to the time that the packet is completed and
   the corresponding ring entry is marked as owned by user space.  If
   during this time other kernel threads enqueue more packets to the
   ring than the size of the ring then it will cause mutliple kernel
   threads to operate on the same entry at the same time, corrupting
   packets and the ring state.

   By marking the entry as allocated (IN_PROGRESS) we prevent other
   kernel threads from incorrectly re-using an entry that is still in
   the progress of being filled in before it is passed to user space.

   This forces each entry through the following states:

   +-> 1. (tp_status == TP_STATUS_KERNEL)
   |  Free: For use by any kernel thread to store a new packet
   |
   |   2. !(tp_status == TP_STATUS_KERNEL) && !(tp_status & TP_STATUS_USER)
   |  Allocated: In use by a *specific* kernel thread
   |
   |   3. (tp_status & TP_STATUS_USER)
   |  Available: Packet available for user space to process
   |
   +-- Loop back to #1 when user space writes entry as TP_STATUS_KERNEL


   No impact on TPACKET_V3 format rings:
   -
   Packet entry ownership is already protected from other kernel
   threads potentially re-using the same entry. This is done inside
   packet_current_rx_frame() where storage is allocated for the
   current packet. Since this is done within the spin_lock no
   additional status updates for this entry are required.


   Defining TP_STATUS_IN_PROGRESS:
   ---
   Rather than defining a new-bit we re-use an existing bit for this
   intermediate state.  Status will eventually be overwritten with the
   actual true status when passed to user space.  Any bit used to pass
   information to user space other than the one that passes ownership
   is suitable (can't use TP_STATUS_USER).  Alternatively a new bit
   could be defined.


2. More detailed discussion:

   Ring entries basically have 2 states, owned by the kernel or owned by
   user space. For single producer/single consumer this works fine. For
   multiple producers there is a window between the call to spin_unlock
   [F] and the call to __packet_set_status [J] where if there are enough
   packets added to the ring by other kernel threads then the ring can
   wrap and multiple threads will end up using the same ring entries.

   This occurs because the ring entry alocated at [C] did not modify the
   state of the entry so it continues to appear as owned by the kernel
   and available for use for new packets even though it has already been
   allocated.

   A simple fix is to temporarily mark the ring entries within the spin
   lock such that user space will still think it?s owned by the kernel
   and other kernel threads will not see it as available to be used for
   new packets. If a kernel thread gets delayed between [F] and [J] for
   an extended period of time and the ring wraps back to the same point
   then subsiquent kernel threads attempts to allocate will fail and be
   treated as the ring being full.

   The change below at [D] uses a newly defined TP_STATUS_IN_PROGRESS bit
   to prevent other kernel threads from re-using the same entry. Note that
   any existing bit other than TP_STATUS_USER could have been used.

   af_packet.c:tpacket_rcv()
  ... code removed for brevity ...

  // Acquire spin lock
A:spin_lock(>sk_receive_queue.lock);

// Preemption is disabled

// Get current ring entry
B:  h.raw = packet_current_rx_frame(
po, skb, TP_STATUS_KERNEL, (macoff+snaplen));

// Get out if ring is full
// Code not show but it will also release lock