Re: [ovs-discuss] Patch for PACKET_OUT getting deleted twice crash

2021-06-15 Thread Ilya Maximets
On 6/15/21 4:31 AM, Tony van der Peet wrote:
> I did the patch as suggested, and it looks something like this:
> 
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4168,9 +4168,13 @@dpif_netdev_execute(struct dpif *dpif, struct 
> dpif_execute *execute)
>    flow_hash_5tuple(execute->flow, 0));
> }
>  
> +    /* We are not losing the original packet, just relying on its previous
> + * owner to properly dispose of it.
> + */
> +    execute->packet = dp_packet_clone(execute->packet);

You're leaking the packet here and destroying the copy later, because
dp_netdev_execute_actions() steals it, so no packet left at all.
'execute->packet' should stay as is, cloned packet should be added to
the 'pp' batch.  Something like:

dp_packet_batch_init_packet(, dp_packet_clone(execute->packet));


> dp_packet_batch_init_packet(, execute->packet);
> -    pp.do_not_steal = true;
> -    dp_netdev_execute_actions(pmd, , false, execute->flow,
> +    pp.do_not_steal = false;
> +    dp_netdev_execute_actions(pmd, , true, execute->flow,
>   execute->actions, execute->actions_len);
> dp_netdev_pmd_flush_output_packets(pmd, true);
> 
> Unfortunately, I now get some errors in the "make check", including:
> 
> 1012: PMD - monitor threads   FAILED (pmd.at:657 
> )
> 1018: ofproto-dpif - active-backup bonding (with primary) FAILED 
> (ofproto-dpif.at:70 )
> 1020: ofproto-dpif - active-backup bonding (without primary) FAILED 
> (ofproto-dpif.at:288 )
> 1184: ofproto-dpif megaflow - normal, balance-tcp bonding FAILED 
> (ofproto-dpif.at:8664 )
> 
> My earlier analysis of the code showed that a call tree starting from 
> handle_upcalls can also result in
> dpif_netdev_execute getting called - I wonder if that's what's happening 
> here? I'll keep analysing these
> new failures, but any pointers would be appreciated.
> 
> Tony
> 
> On Tue, Jun 15, 2021 at 12:28 PM Tony van der Peet  > wrote:
> 
> Yes, I can create that patch.
> 
> Tony
> 
> On Sat, Jun 12, 2021 at 5:24 AM Ilya Maximets  > wrote:
> 
> On 6/9/21 3:12 PM, Aaron Conole wrote:
> > Ilya Maximets mailto:i.maxim...@ovn.org>> 
> writes:
> >
> >> On 6/7/21 3:59 PM, Ilya Maximets wrote:
> >>> On 6/7/21 3:09 PM, Aaron Conole wrote:
>  Ilya Maximets mailto:i.maxim...@ovn.org>> 
> writes:
> 
> >> Here is a patch with both a test and a fix.
> 
>  Thanks so much!  It's nice to get fixes, but I think it's really 
> great
>  when test cases come along with them.
> 
> > Hi.  Thanks for working n this!
> >
> > CC: ovs-dev
> >
> >> Not submitting as a formal
> >> patch because I would like some feedback on whether 1) 
> maintainers feel
> >> this is worth fixing and
> >
> > I can reproduce the crash with your test.  Basically, actions 
> in userspace
> > datapath may drop packets if something goes wrong.  'meter' 
> action just
> > seems to be the most explicit variant.  So, I think, this is 
> definitely
> > worth fixing as some other condition might trigger this crash 
> on packet-out
> > as well.
> >
> > ==2568112==ERROR: AddressSanitizer: heap-use-after-free
> > on address 0x6160699c at pc 0x00573860 bp 
> 0x7ffebc6cc880 sp 0x7ffebc6cc878
> > READ of size 1 at 0x6160699c thread T0
> >     #0 0x57385f in dp_packet_delete lib/dp-packet.h:242:16
> >     #1 0x57372c in ofproto_packet_out_uninit 
> ofproto/ofproto.c:3562:5
> >     #2 0x585e77 in handle_packet_out ofproto/ofproto.c:3722:5
> >     #3 0x583801 in handle_single_part_openflow 
> ofproto/ofproto.c:8499:16
> >     #4 0x570c9c in handle_openflow ofproto/ofproto.c:8686:21
> >     #5 0x611781 in ofconn_run ofproto/connmgr.c:1329:13
> >     #6 0x6112ed in connmgr_run ofproto/connmgr.c:356:9
> >     #7 0x56fdf4 in ofproto_run ofproto/ofproto.c:1891:5
> >     #8 0x545ec0 in bridge_run__ vswitchd/bridge.c:3251:9
> >     #9 0x5456a5 in bridge_run vswitchd/bridge.c:3310:5
> >     #10 0x55f5b1 in main vswitchd/ovs-vswitchd.c:127:9
> >     #11 0x7f85bfe09081 in __libc_start_main 
> (/lib64/libc.so.6+0x27081)
> >     #12 0x46d00d in _start (vswitchd/ovs-vswitchd+0x46d00d)
> >
> >> 2) whether this is the way to fix it.
> >
> > See inline.
> >
> >>
> >> I have tried to make the most minimal 

Re: [ovs-discuss] Patch for PACKET_OUT getting deleted twice crash

2021-06-14 Thread Tony van der Peet
I did the patch as suggested, and it looks something like this:

--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -4168,9 +4168,13 @@ dpif_netdev_execute(struct dpif *dpif, struct
dpif_execute *execute)
   flow_hash_5tuple(execute->flow, 0));
}

+/* We are not losing the original packet, just relying on its previous
+ * owner to properly dispose of it.
+ */
+execute->packet = dp_packet_clone(execute->packet);
dp_packet_batch_init_packet(, execute->packet);
-pp.do_not_steal = true;
-dp_netdev_execute_actions(pmd, , false, execute->flow,
+pp.do_not_steal = false;
+dp_netdev_execute_actions(pmd, , true, execute->flow,
  execute->actions, execute->actions_len);
dp_netdev_pmd_flush_output_packets(pmd, true);

Unfortunately, I now get some errors in the "make check", including:

1012: PMD - monitor threads   FAILED (pmd.at:657)
1018: ofproto-dpif - active-backup bonding (with primary) FAILED (
ofproto-dpif.at:70)
1020: ofproto-dpif - active-backup bonding (without primary) FAILED (
ofproto-dpif.at:288)
1184: ofproto-dpif megaflow - normal, balance-tcp bonding FAILED (
ofproto-dpif.at:8664)

My earlier analysis of the code showed that a call tree starting from
handle_upcalls can also result in
dpif_netdev_execute getting called - I wonder if that's what's happening
here? I'll keep analysing these
new failures, but any pointers would be appreciated.

Tony

On Tue, Jun 15, 2021 at 12:28 PM Tony van der Peet <
tony.vanderp...@gmail.com> wrote:

> Yes, I can create that patch.
>
> Tony
>
> On Sat, Jun 12, 2021 at 5:24 AM Ilya Maximets  wrote:
>
>> On 6/9/21 3:12 PM, Aaron Conole wrote:
>> > Ilya Maximets  writes:
>> >
>> >> On 6/7/21 3:59 PM, Ilya Maximets wrote:
>> >>> On 6/7/21 3:09 PM, Aaron Conole wrote:
>>  Ilya Maximets  writes:
>> 
>> >> Here is a patch with both a test and a fix.
>> 
>>  Thanks so much!  It's nice to get fixes, but I think it's really
>> great
>>  when test cases come along with them.
>> 
>> > Hi.  Thanks for working n this!
>> >
>> > CC: ovs-dev
>> >
>> >> Not submitting as a formal
>> >> patch because I would like some feedback on whether 1) maintainers
>> feel
>> >> this is worth fixing and
>> >
>> > I can reproduce the crash with your test.  Basically, actions in
>> userspace
>> > datapath may drop packets if something goes wrong.  'meter' action
>> just
>> > seems to be the most explicit variant.  So, I think, this is
>> definitely
>> > worth fixing as some other condition might trigger this crash on
>> packet-out
>> > as well.
>> >
>> > ==2568112==ERROR: AddressSanitizer: heap-use-after-free
>> > on address 0x6160699c at pc 0x00573860 bp 0x7ffebc6cc880 sp
>> 0x7ffebc6cc878
>> > READ of size 1 at 0x6160699c thread T0
>> > #0 0x57385f in dp_packet_delete lib/dp-packet.h:242:16
>> > #1 0x57372c in ofproto_packet_out_uninit
>> ofproto/ofproto.c:3562:5
>> > #2 0x585e77 in handle_packet_out ofproto/ofproto.c:3722:5
>> > #3 0x583801 in handle_single_part_openflow
>> ofproto/ofproto.c:8499:16
>> > #4 0x570c9c in handle_openflow ofproto/ofproto.c:8686:21
>> > #5 0x611781 in ofconn_run ofproto/connmgr.c:1329:13
>> > #6 0x6112ed in connmgr_run ofproto/connmgr.c:356:9
>> > #7 0x56fdf4 in ofproto_run ofproto/ofproto.c:1891:5
>> > #8 0x545ec0 in bridge_run__ vswitchd/bridge.c:3251:9
>> > #9 0x5456a5 in bridge_run vswitchd/bridge.c:3310:5
>> > #10 0x55f5b1 in main vswitchd/ovs-vswitchd.c:127:9
>> > #11 0x7f85bfe09081 in __libc_start_main
>> (/lib64/libc.so.6+0x27081)
>> > #12 0x46d00d in _start (vswitchd/ovs-vswitchd+0x46d00d)
>> >
>> >> 2) whether this is the way to fix it.
>> >
>> > See inline.
>> >
>> >>
>> >> I have tried to make the most minimal change possible, but this
>> means that
>> >> there might be paths through the code that give unexpected
>> behaviour (which
>> >> in the worst case would be a memory leak I suppose).
>> >>
>> >> Tony
>> >>
>> >> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>> >> index 246be14d0..5e0dabe67 100644
>> >> --- a/lib/dp-packet.h
>> >> +++ b/lib/dp-packet.h
>> >> @@ -739,6 +739,7 @@ struct dp_packet_batch {
>> >> size_t count;
>> >> bool trunc; /* true if the batch needs truncate. */
>> >> bool do_not_steal; /* Indicate that the packets should not be
>> stolen.
>> >> */
>> >> +bool packet_out; /* Indicate single packet is PACKET_OUT */
>> >> struct dp_packet *packets[NETDEV_MAX_BURST];
>> >> };
>> >>
>> >> @@ -748,6 +749,7 @@ dp_packet_batch_init(struct dp_packet_batch
>> *batch)
>> >> batch->count = 0;
>> >> batch->trunc = false;
>> >> batch->do_not_steal = false;
>> >> +

Re: [ovs-discuss] Patch for PACKET_OUT getting deleted twice crash

2021-06-14 Thread Tony van der Peet
Yes, I can create that patch.

Tony

On Sat, Jun 12, 2021 at 5:24 AM Ilya Maximets  wrote:

> On 6/9/21 3:12 PM, Aaron Conole wrote:
> > Ilya Maximets  writes:
> >
> >> On 6/7/21 3:59 PM, Ilya Maximets wrote:
> >>> On 6/7/21 3:09 PM, Aaron Conole wrote:
>  Ilya Maximets  writes:
> 
> >> Here is a patch with both a test and a fix.
> 
>  Thanks so much!  It's nice to get fixes, but I think it's really great
>  when test cases come along with them.
> 
> > Hi.  Thanks for working n this!
> >
> > CC: ovs-dev
> >
> >> Not submitting as a formal
> >> patch because I would like some feedback on whether 1) maintainers
> feel
> >> this is worth fixing and
> >
> > I can reproduce the crash with your test.  Basically, actions in
> userspace
> > datapath may drop packets if something goes wrong.  'meter' action
> just
> > seems to be the most explicit variant.  So, I think, this is
> definitely
> > worth fixing as some other condition might trigger this crash on
> packet-out
> > as well.
> >
> > ==2568112==ERROR: AddressSanitizer: heap-use-after-free
> > on address 0x6160699c at pc 0x00573860 bp 0x7ffebc6cc880 sp
> 0x7ffebc6cc878
> > READ of size 1 at 0x6160699c thread T0
> > #0 0x57385f in dp_packet_delete lib/dp-packet.h:242:16
> > #1 0x57372c in ofproto_packet_out_uninit ofproto/ofproto.c:3562:5
> > #2 0x585e77 in handle_packet_out ofproto/ofproto.c:3722:5
> > #3 0x583801 in handle_single_part_openflow
> ofproto/ofproto.c:8499:16
> > #4 0x570c9c in handle_openflow ofproto/ofproto.c:8686:21
> > #5 0x611781 in ofconn_run ofproto/connmgr.c:1329:13
> > #6 0x6112ed in connmgr_run ofproto/connmgr.c:356:9
> > #7 0x56fdf4 in ofproto_run ofproto/ofproto.c:1891:5
> > #8 0x545ec0 in bridge_run__ vswitchd/bridge.c:3251:9
> > #9 0x5456a5 in bridge_run vswitchd/bridge.c:3310:5
> > #10 0x55f5b1 in main vswitchd/ovs-vswitchd.c:127:9
> > #11 0x7f85bfe09081 in __libc_start_main
> (/lib64/libc.so.6+0x27081)
> > #12 0x46d00d in _start (vswitchd/ovs-vswitchd+0x46d00d)
> >
> >> 2) whether this is the way to fix it.
> >
> > See inline.
> >
> >>
> >> I have tried to make the most minimal change possible, but this
> means that
> >> there might be paths through the code that give unexpected
> behaviour (which
> >> in the worst case would be a memory leak I suppose).
> >>
> >> Tony
> >>
> >> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> >> index 246be14d0..5e0dabe67 100644
> >> --- a/lib/dp-packet.h
> >> +++ b/lib/dp-packet.h
> >> @@ -739,6 +739,7 @@ struct dp_packet_batch {
> >> size_t count;
> >> bool trunc; /* true if the batch needs truncate. */
> >> bool do_not_steal; /* Indicate that the packets should not be
> stolen.
> >> */
> >> +bool packet_out; /* Indicate single packet is PACKET_OUT */
> >> struct dp_packet *packets[NETDEV_MAX_BURST];
> >> };
> >>
> >> @@ -748,6 +749,7 @@ dp_packet_batch_init(struct dp_packet_batch
> *batch)
> >> batch->count = 0;
> >> batch->trunc = false;
> >> batch->do_not_steal = false;
> >> +batch->packet_out = false;
> >> }
> >>
> >> static inline void
> >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> >> index 650e67ab3..deba4a94a 100644
> >> --- a/lib/dpif-netdev.c
> >> +++ b/lib/dpif-netdev.c
> >> @@ -4170,6 +4170,7 @@ dpif_netdev_execute(struct dpif *dpif, struct
> >> dpif_execute *execute)
> >>
> >> dp_packet_batch_init_packet(, execute->packet);
> >> pp.do_not_steal = true;
> >> +pp.packet_out = execute->packet_out;
> >> dp_netdev_execute_actions(pmd, , false, execute->flow,
> >>   execute->actions,
> execute->actions_len);
> >
> > There is already a dirty hack named "do_not_steal" that was
> introduced,
> > I guess, exactly to avoid crash in the conntrack code that could
> drop/steal
> > the packet just like meter action.  And it seems that here in
> > dpif_netdev_execute() is the only problematic entry point as all
> other
> > normal paths expects that packet might be destroyed.
> >
> > The problem was, I suppose, introduced when we tried to unify
> semantics
> > of "may_steal" flag by turning it into "should_steal".  But it seems
> that
> > in this function we really need to prohibit stealing of the packet
> since
> > ofproto layer still owns it regardless of the result of execution.
> >
> > I don't think that we need one more flag here, but we have several
> options
> > how to fix the crash:
> >
> > 1. Start honoring batch->do_not_steal flag in all actions that may
> result
> >in packet drops.  As the original idea of having 'do_not_steal'
> flag for
> 

Re: [ovs-discuss] Patch for PACKET_OUT getting deleted twice crash

2021-06-11 Thread Ilya Maximets
On 6/9/21 3:12 PM, Aaron Conole wrote:
> Ilya Maximets  writes:
> 
>> On 6/7/21 3:59 PM, Ilya Maximets wrote:
>>> On 6/7/21 3:09 PM, Aaron Conole wrote:
 Ilya Maximets  writes:

>> Here is a patch with both a test and a fix.

 Thanks so much!  It's nice to get fixes, but I think it's really great
 when test cases come along with them.

> Hi.  Thanks for working n this!
>
> CC: ovs-dev
>
>> Not submitting as a formal
>> patch because I would like some feedback on whether 1) maintainers feel
>> this is worth fixing and
>
> I can reproduce the crash with your test.  Basically, actions in userspace
> datapath may drop packets if something goes wrong.  'meter' action just
> seems to be the most explicit variant.  So, I think, this is definitely
> worth fixing as some other condition might trigger this crash on 
> packet-out
> as well.
>
> ==2568112==ERROR: AddressSanitizer: heap-use-after-free
> on address 0x6160699c at pc 0x00573860 bp 0x7ffebc6cc880 sp 
> 0x7ffebc6cc878
> READ of size 1 at 0x6160699c thread T0
> #0 0x57385f in dp_packet_delete lib/dp-packet.h:242:16
> #1 0x57372c in ofproto_packet_out_uninit ofproto/ofproto.c:3562:5
> #2 0x585e77 in handle_packet_out ofproto/ofproto.c:3722:5
> #3 0x583801 in handle_single_part_openflow ofproto/ofproto.c:8499:16
> #4 0x570c9c in handle_openflow ofproto/ofproto.c:8686:21
> #5 0x611781 in ofconn_run ofproto/connmgr.c:1329:13
> #6 0x6112ed in connmgr_run ofproto/connmgr.c:356:9
> #7 0x56fdf4 in ofproto_run ofproto/ofproto.c:1891:5
> #8 0x545ec0 in bridge_run__ vswitchd/bridge.c:3251:9
> #9 0x5456a5 in bridge_run vswitchd/bridge.c:3310:5
> #10 0x55f5b1 in main vswitchd/ovs-vswitchd.c:127:9
> #11 0x7f85bfe09081 in __libc_start_main (/lib64/libc.so.6+0x27081)
> #12 0x46d00d in _start (vswitchd/ovs-vswitchd+0x46d00d)
>
>> 2) whether this is the way to fix it.
>
> See inline.
>
>>
>> I have tried to make the most minimal change possible, but this means 
>> that
>> there might be paths through the code that give unexpected behaviour 
>> (which
>> in the worst case would be a memory leak I suppose).
>>
>> Tony
>>
>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>> index 246be14d0..5e0dabe67 100644
>> --- a/lib/dp-packet.h
>> +++ b/lib/dp-packet.h
>> @@ -739,6 +739,7 @@ struct dp_packet_batch {
>> size_t count;
>> bool trunc; /* true if the batch needs truncate. */
>> bool do_not_steal; /* Indicate that the packets should not be stolen.
>> */
>> +bool packet_out; /* Indicate single packet is PACKET_OUT */
>> struct dp_packet *packets[NETDEV_MAX_BURST];
>> };
>>
>> @@ -748,6 +749,7 @@ dp_packet_batch_init(struct dp_packet_batch *batch)
>> batch->count = 0;
>> batch->trunc = false;
>> batch->do_not_steal = false;
>> +batch->packet_out = false;
>> }
>>
>> static inline void
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 650e67ab3..deba4a94a 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -4170,6 +4170,7 @@ dpif_netdev_execute(struct dpif *dpif, struct
>> dpif_execute *execute)
>>
>> dp_packet_batch_init_packet(, execute->packet);
>> pp.do_not_steal = true;
>> +pp.packet_out = execute->packet_out;
>> dp_netdev_execute_actions(pmd, , false, execute->flow,
>>   execute->actions, execute->actions_len);
>
> There is already a dirty hack named "do_not_steal" that was introduced,
> I guess, exactly to avoid crash in the conntrack code that could 
> drop/steal
> the packet just like meter action.  And it seems that here in
> dpif_netdev_execute() is the only problematic entry point as all other
> normal paths expects that packet might be destroyed.
>
> The problem was, I suppose, introduced when we tried to unify semantics
> of "may_steal" flag by turning it into "should_steal".  But it seems that
> in this function we really need to prohibit stealing of the packet since
> ofproto layer still owns it regardless of the result of execution.
>
> I don't think that we need one more flag here, but we have several options
> how to fix the crash:
>
> 1. Start honoring batch->do_not_steal flag in all actions that may result
>in packet drops.  As the original idea of having 'do_not_steal' flag 
> for
>a batch is very hacky, I'd like to not do that.

 +1

> 2. Try to propagate information that packet was deleted up to ofproto 
> layer,
>i.e. make handle_packet_out() aware of that.  Will, probably, be not 
> that
>easy to do.

 I had a look at 

Re: [ovs-discuss] Patch for PACKET_OUT getting deleted twice crash

2021-06-09 Thread Aaron Conole
Ilya Maximets  writes:

> On 6/7/21 3:59 PM, Ilya Maximets wrote:
>> On 6/7/21 3:09 PM, Aaron Conole wrote:
>>> Ilya Maximets  writes:
>>>
> Here is a patch with both a test and a fix.
>>>
>>> Thanks so much!  It's nice to get fixes, but I think it's really great
>>> when test cases come along with them.
>>>
 Hi.  Thanks for working n this!

 CC: ovs-dev

> Not submitting as a formal
> patch because I would like some feedback on whether 1) maintainers feel
> this is worth fixing and

 I can reproduce the crash with your test.  Basically, actions in userspace
 datapath may drop packets if something goes wrong.  'meter' action just
 seems to be the most explicit variant.  So, I think, this is definitely
 worth fixing as some other condition might trigger this crash on packet-out
 as well.

 ==2568112==ERROR: AddressSanitizer: heap-use-after-free
 on address 0x6160699c at pc 0x00573860 bp 0x7ffebc6cc880 sp 
 0x7ffebc6cc878
 READ of size 1 at 0x6160699c thread T0
 #0 0x57385f in dp_packet_delete lib/dp-packet.h:242:16
 #1 0x57372c in ofproto_packet_out_uninit ofproto/ofproto.c:3562:5
 #2 0x585e77 in handle_packet_out ofproto/ofproto.c:3722:5
 #3 0x583801 in handle_single_part_openflow ofproto/ofproto.c:8499:16
 #4 0x570c9c in handle_openflow ofproto/ofproto.c:8686:21
 #5 0x611781 in ofconn_run ofproto/connmgr.c:1329:13
 #6 0x6112ed in connmgr_run ofproto/connmgr.c:356:9
 #7 0x56fdf4 in ofproto_run ofproto/ofproto.c:1891:5
 #8 0x545ec0 in bridge_run__ vswitchd/bridge.c:3251:9
 #9 0x5456a5 in bridge_run vswitchd/bridge.c:3310:5
 #10 0x55f5b1 in main vswitchd/ovs-vswitchd.c:127:9
 #11 0x7f85bfe09081 in __libc_start_main (/lib64/libc.so.6+0x27081)
 #12 0x46d00d in _start (vswitchd/ovs-vswitchd+0x46d00d)

> 2) whether this is the way to fix it.

 See inline.

>
> I have tried to make the most minimal change possible, but this means that
> there might be paths through the code that give unexpected behaviour 
> (which
> in the worst case would be a memory leak I suppose).
>
> Tony
>
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index 246be14d0..5e0dabe67 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -739,6 +739,7 @@ struct dp_packet_batch {
> size_t count;
> bool trunc; /* true if the batch needs truncate. */
> bool do_not_steal; /* Indicate that the packets should not be stolen.
> */
> +bool packet_out; /* Indicate single packet is PACKET_OUT */
> struct dp_packet *packets[NETDEV_MAX_BURST];
> };
>
> @@ -748,6 +749,7 @@ dp_packet_batch_init(struct dp_packet_batch *batch)
> batch->count = 0;
> batch->trunc = false;
> batch->do_not_steal = false;
> +batch->packet_out = false;
> }
>
> static inline void
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 650e67ab3..deba4a94a 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4170,6 +4170,7 @@ dpif_netdev_execute(struct dpif *dpif, struct
> dpif_execute *execute)
>
> dp_packet_batch_init_packet(, execute->packet);
> pp.do_not_steal = true;
> +pp.packet_out = execute->packet_out;
> dp_netdev_execute_actions(pmd, , false, execute->flow,
>   execute->actions, execute->actions_len);

 There is already a dirty hack named "do_not_steal" that was introduced,
 I guess, exactly to avoid crash in the conntrack code that could drop/steal
 the packet just like meter action.  And it seems that here in
 dpif_netdev_execute() is the only problematic entry point as all other
 normal paths expects that packet might be destroyed.

 The problem was, I suppose, introduced when we tried to unify semantics
 of "may_steal" flag by turning it into "should_steal".  But it seems that
 in this function we really need to prohibit stealing of the packet since
 ofproto layer still owns it regardless of the result of execution.

 I don't think that we need one more flag here, but we have several options
 how to fix the crash:

 1. Start honoring batch->do_not_steal flag in all actions that may result
in packet drops.  As the original idea of having 'do_not_steal' flag for
a batch is very hacky, I'd like to not do that.
>>>
>>> +1
>>>
 2. Try to propagate information that packet was deleted up to ofproto 
 layer,
i.e. make handle_packet_out() aware of that.  Will, probably, be not 
 that
easy to do.
>>>
>>> I had a look at doing this, but as you note it is quite intrusive, and
>>> we need to make changes all over.
>>>
 3. This function (dpif_netdev_execute) is not on a hot path in userspace

Re: [ovs-discuss] Patch for PACKET_OUT getting deleted twice crash

2021-06-07 Thread Ilya Maximets
On 6/7/21 3:59 PM, Ilya Maximets wrote:
> On 6/7/21 3:09 PM, Aaron Conole wrote:
>> Ilya Maximets  writes:
>>
 Here is a patch with both a test and a fix.
>>
>> Thanks so much!  It's nice to get fixes, but I think it's really great
>> when test cases come along with them.
>>
>>> Hi.  Thanks for working n this!
>>>
>>> CC: ovs-dev
>>>
 Not submitting as a formal
 patch because I would like some feedback on whether 1) maintainers feel
 this is worth fixing and
>>>
>>> I can reproduce the crash with your test.  Basically, actions in userspace
>>> datapath may drop packets if something goes wrong.  'meter' action just
>>> seems to be the most explicit variant.  So, I think, this is definitely
>>> worth fixing as some other condition might trigger this crash on packet-out
>>> as well.
>>>
>>> ==2568112==ERROR: AddressSanitizer: heap-use-after-free
>>> on address 0x6160699c at pc 0x00573860 bp 0x7ffebc6cc880 sp 
>>> 0x7ffebc6cc878
>>> READ of size 1 at 0x6160699c thread T0
>>> #0 0x57385f in dp_packet_delete lib/dp-packet.h:242:16
>>> #1 0x57372c in ofproto_packet_out_uninit ofproto/ofproto.c:3562:5
>>> #2 0x585e77 in handle_packet_out ofproto/ofproto.c:3722:5
>>> #3 0x583801 in handle_single_part_openflow ofproto/ofproto.c:8499:16
>>> #4 0x570c9c in handle_openflow ofproto/ofproto.c:8686:21
>>> #5 0x611781 in ofconn_run ofproto/connmgr.c:1329:13
>>> #6 0x6112ed in connmgr_run ofproto/connmgr.c:356:9
>>> #7 0x56fdf4 in ofproto_run ofproto/ofproto.c:1891:5
>>> #8 0x545ec0 in bridge_run__ vswitchd/bridge.c:3251:9
>>> #9 0x5456a5 in bridge_run vswitchd/bridge.c:3310:5
>>> #10 0x55f5b1 in main vswitchd/ovs-vswitchd.c:127:9
>>> #11 0x7f85bfe09081 in __libc_start_main (/lib64/libc.so.6+0x27081)
>>> #12 0x46d00d in _start (vswitchd/ovs-vswitchd+0x46d00d)
>>>
 2) whether this is the way to fix it.
>>>
>>> See inline.
>>>

 I have tried to make the most minimal change possible, but this means that
 there might be paths through the code that give unexpected behaviour (which
 in the worst case would be a memory leak I suppose).

 Tony

 diff --git a/lib/dp-packet.h b/lib/dp-packet.h
 index 246be14d0..5e0dabe67 100644
 --- a/lib/dp-packet.h
 +++ b/lib/dp-packet.h
 @@ -739,6 +739,7 @@ struct dp_packet_batch {
 size_t count;
 bool trunc; /* true if the batch needs truncate. */
 bool do_not_steal; /* Indicate that the packets should not be stolen.
 */
 +bool packet_out; /* Indicate single packet is PACKET_OUT */
 struct dp_packet *packets[NETDEV_MAX_BURST];
 };

 @@ -748,6 +749,7 @@ dp_packet_batch_init(struct dp_packet_batch *batch)
 batch->count = 0;
 batch->trunc = false;
 batch->do_not_steal = false;
 +batch->packet_out = false;
 }

 static inline void
 diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
 index 650e67ab3..deba4a94a 100644
 --- a/lib/dpif-netdev.c
 +++ b/lib/dpif-netdev.c
 @@ -4170,6 +4170,7 @@ dpif_netdev_execute(struct dpif *dpif, struct
 dpif_execute *execute)

 dp_packet_batch_init_packet(, execute->packet);
 pp.do_not_steal = true;
 +pp.packet_out = execute->packet_out;
 dp_netdev_execute_actions(pmd, , false, execute->flow,
   execute->actions, execute->actions_len);
>>>
>>> There is already a dirty hack named "do_not_steal" that was introduced,
>>> I guess, exactly to avoid crash in the conntrack code that could drop/steal
>>> the packet just like meter action.  And it seems that here in
>>> dpif_netdev_execute() is the only problematic entry point as all other
>>> normal paths expects that packet might be destroyed.
>>>
>>> The problem was, I suppose, introduced when we tried to unify semantics
>>> of "may_steal" flag by turning it into "should_steal".  But it seems that
>>> in this function we really need to prohibit stealing of the packet since
>>> ofproto layer still owns it regardless of the result of execution.
>>>
>>> I don't think that we need one more flag here, but we have several options
>>> how to fix the crash:
>>>
>>> 1. Start honoring batch->do_not_steal flag in all actions that may result
>>>in packet drops.  As the original idea of having 'do_not_steal' flag for
>>>a batch is very hacky, I'd like to not do that.
>>
>> +1
>>
>>> 2. Try to propagate information that packet was deleted up to ofproto layer,
>>>i.e. make handle_packet_out() aware of that.  Will, probably, be not that
>>>easy to do.
>>
>> I had a look at doing this, but as you note it is quite intrusive, and
>> we need to make changes all over.
>>
>>> 3. This function (dpif_netdev_execute) is not on a hot path in userspace
>>>datapath, IIUC.  It might be that it's just easier to remove the
>>>'do_not_steal' flag entirely, clone the packet here and call the
>>>

Re: [ovs-discuss] Patch for PACKET_OUT getting deleted twice crash

2021-06-07 Thread Ilya Maximets
On 6/7/21 3:09 PM, Aaron Conole wrote:
> Ilya Maximets  writes:
> 
>>> Here is a patch with both a test and a fix.
> 
> Thanks so much!  It's nice to get fixes, but I think it's really great
> when test cases come along with them.
> 
>> Hi.  Thanks for working n this!
>>
>> CC: ovs-dev
>>
>>> Not submitting as a formal
>>> patch because I would like some feedback on whether 1) maintainers feel
>>> this is worth fixing and
>>
>> I can reproduce the crash with your test.  Basically, actions in userspace
>> datapath may drop packets if something goes wrong.  'meter' action just
>> seems to be the most explicit variant.  So, I think, this is definitely
>> worth fixing as some other condition might trigger this crash on packet-out
>> as well.
>>
>> ==2568112==ERROR: AddressSanitizer: heap-use-after-free
>> on address 0x6160699c at pc 0x00573860 bp 0x7ffebc6cc880 sp 
>> 0x7ffebc6cc878
>> READ of size 1 at 0x6160699c thread T0
>> #0 0x57385f in dp_packet_delete lib/dp-packet.h:242:16
>> #1 0x57372c in ofproto_packet_out_uninit ofproto/ofproto.c:3562:5
>> #2 0x585e77 in handle_packet_out ofproto/ofproto.c:3722:5
>> #3 0x583801 in handle_single_part_openflow ofproto/ofproto.c:8499:16
>> #4 0x570c9c in handle_openflow ofproto/ofproto.c:8686:21
>> #5 0x611781 in ofconn_run ofproto/connmgr.c:1329:13
>> #6 0x6112ed in connmgr_run ofproto/connmgr.c:356:9
>> #7 0x56fdf4 in ofproto_run ofproto/ofproto.c:1891:5
>> #8 0x545ec0 in bridge_run__ vswitchd/bridge.c:3251:9
>> #9 0x5456a5 in bridge_run vswitchd/bridge.c:3310:5
>> #10 0x55f5b1 in main vswitchd/ovs-vswitchd.c:127:9
>> #11 0x7f85bfe09081 in __libc_start_main (/lib64/libc.so.6+0x27081)
>> #12 0x46d00d in _start (vswitchd/ovs-vswitchd+0x46d00d)
>>
>>> 2) whether this is the way to fix it.
>>
>> See inline.
>>
>>>
>>> I have tried to make the most minimal change possible, but this means that
>>> there might be paths through the code that give unexpected behaviour (which
>>> in the worst case would be a memory leak I suppose).
>>>
>>> Tony
>>>
>>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>>> index 246be14d0..5e0dabe67 100644
>>> --- a/lib/dp-packet.h
>>> +++ b/lib/dp-packet.h
>>> @@ -739,6 +739,7 @@ struct dp_packet_batch {
>>> size_t count;
>>> bool trunc; /* true if the batch needs truncate. */
>>> bool do_not_steal; /* Indicate that the packets should not be stolen.
>>> */
>>> +bool packet_out; /* Indicate single packet is PACKET_OUT */
>>> struct dp_packet *packets[NETDEV_MAX_BURST];
>>> };
>>>
>>> @@ -748,6 +749,7 @@ dp_packet_batch_init(struct dp_packet_batch *batch)
>>> batch->count = 0;
>>> batch->trunc = false;
>>> batch->do_not_steal = false;
>>> +batch->packet_out = false;
>>> }
>>>
>>> static inline void
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>> index 650e67ab3..deba4a94a 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -4170,6 +4170,7 @@ dpif_netdev_execute(struct dpif *dpif, struct
>>> dpif_execute *execute)
>>>
>>> dp_packet_batch_init_packet(, execute->packet);
>>> pp.do_not_steal = true;
>>> +pp.packet_out = execute->packet_out;
>>> dp_netdev_execute_actions(pmd, , false, execute->flow,
>>>   execute->actions, execute->actions_len);
>>
>> There is already a dirty hack named "do_not_steal" that was introduced,
>> I guess, exactly to avoid crash in the conntrack code that could drop/steal
>> the packet just like meter action.  And it seems that here in
>> dpif_netdev_execute() is the only problematic entry point as all other
>> normal paths expects that packet might be destroyed.
>>
>> The problem was, I suppose, introduced when we tried to unify semantics
>> of "may_steal" flag by turning it into "should_steal".  But it seems that
>> in this function we really need to prohibit stealing of the packet since
>> ofproto layer still owns it regardless of the result of execution.
>>
>> I don't think that we need one more flag here, but we have several options
>> how to fix the crash:
>>
>> 1. Start honoring batch->do_not_steal flag in all actions that may result
>>in packet drops.  As the original idea of having 'do_not_steal' flag for
>>a batch is very hacky, I'd like to not do that.
> 
> +1
> 
>> 2. Try to propagate information that packet was deleted up to ofproto layer,
>>i.e. make handle_packet_out() aware of that.  Will, probably, be not that
>>easy to do.
> 
> I had a look at doing this, but as you note it is quite intrusive, and
> we need to make changes all over.
> 
>> 3. This function (dpif_netdev_execute) is not on a hot path in userspace
>>datapath, IIUC.  It might be that it's just easier to remove the
>>'do_not_steal' flag entirely, clone the packet here and call the
>>dp_netdev_execute_actions() with should_steal=true.
>>This sounds like the best solution for me, unless I overlooked some
>>scenario, where this code is on a 

Re: [ovs-discuss] Patch for PACKET_OUT getting deleted twice crash

2021-06-07 Thread Aaron Conole
Ilya Maximets  writes:

>> Here is a patch with both a test and a fix.

Thanks so much!  It's nice to get fixes, but I think it's really great
when test cases come along with them.

> Hi.  Thanks for working n this!
>
> CC: ovs-dev
>
>> Not submitting as a formal
>> patch because I would like some feedback on whether 1) maintainers feel
>> this is worth fixing and
>
> I can reproduce the crash with your test.  Basically, actions in userspace
> datapath may drop packets if something goes wrong.  'meter' action just
> seems to be the most explicit variant.  So, I think, this is definitely
> worth fixing as some other condition might trigger this crash on packet-out
> as well.
>
> ==2568112==ERROR: AddressSanitizer: heap-use-after-free
> on address 0x6160699c at pc 0x00573860 bp 0x7ffebc6cc880 sp 
> 0x7ffebc6cc878
> READ of size 1 at 0x6160699c thread T0
> #0 0x57385f in dp_packet_delete lib/dp-packet.h:242:16
> #1 0x57372c in ofproto_packet_out_uninit ofproto/ofproto.c:3562:5
> #2 0x585e77 in handle_packet_out ofproto/ofproto.c:3722:5
> #3 0x583801 in handle_single_part_openflow ofproto/ofproto.c:8499:16
> #4 0x570c9c in handle_openflow ofproto/ofproto.c:8686:21
> #5 0x611781 in ofconn_run ofproto/connmgr.c:1329:13
> #6 0x6112ed in connmgr_run ofproto/connmgr.c:356:9
> #7 0x56fdf4 in ofproto_run ofproto/ofproto.c:1891:5
> #8 0x545ec0 in bridge_run__ vswitchd/bridge.c:3251:9
> #9 0x5456a5 in bridge_run vswitchd/bridge.c:3310:5
> #10 0x55f5b1 in main vswitchd/ovs-vswitchd.c:127:9
> #11 0x7f85bfe09081 in __libc_start_main (/lib64/libc.so.6+0x27081)
> #12 0x46d00d in _start (vswitchd/ovs-vswitchd+0x46d00d)
>
>> 2) whether this is the way to fix it.
>
> See inline.
>
>> 
>> I have tried to make the most minimal change possible, but this means that
>> there might be paths through the code that give unexpected behaviour (which
>> in the worst case would be a memory leak I suppose).
>> 
>> Tony
>> 
>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>> index 246be14d0..5e0dabe67 100644
>> --- a/lib/dp-packet.h
>> +++ b/lib/dp-packet.h
>> @@ -739,6 +739,7 @@ struct dp_packet_batch {
>> size_t count;
>> bool trunc; /* true if the batch needs truncate. */
>> bool do_not_steal; /* Indicate that the packets should not be stolen.
>> */
>> +bool packet_out; /* Indicate single packet is PACKET_OUT */
>> struct dp_packet *packets[NETDEV_MAX_BURST];
>> };
>> 
>> @@ -748,6 +749,7 @@ dp_packet_batch_init(struct dp_packet_batch *batch)
>> batch->count = 0;
>> batch->trunc = false;
>> batch->do_not_steal = false;
>> +batch->packet_out = false;
>> }
>> 
>> static inline void
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 650e67ab3..deba4a94a 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -4170,6 +4170,7 @@ dpif_netdev_execute(struct dpif *dpif, struct
>> dpif_execute *execute)
>> 
>> dp_packet_batch_init_packet(, execute->packet);
>> pp.do_not_steal = true;
>> +pp.packet_out = execute->packet_out;
>> dp_netdev_execute_actions(pmd, , false, execute->flow,
>>   execute->actions, execute->actions_len);
>
> There is already a dirty hack named "do_not_steal" that was introduced,
> I guess, exactly to avoid crash in the conntrack code that could drop/steal
> the packet just like meter action.  And it seems that here in
> dpif_netdev_execute() is the only problematic entry point as all other
> normal paths expects that packet might be destroyed.
>
> The problem was, I suppose, introduced when we tried to unify semantics
> of "may_steal" flag by turning it into "should_steal".  But it seems that
> in this function we really need to prohibit stealing of the packet since
> ofproto layer still owns it regardless of the result of execution.
>
> I don't think that we need one more flag here, but we have several options
> how to fix the crash:
>
> 1. Start honoring batch->do_not_steal flag in all actions that may result
>in packet drops.  As the original idea of having 'do_not_steal' flag for
>a batch is very hacky, I'd like to not do that.

+1

> 2. Try to propagate information that packet was deleted up to ofproto layer,
>i.e. make handle_packet_out() aware of that.  Will, probably, be not that
>easy to do.

I had a look at doing this, but as you note it is quite intrusive, and
we need to make changes all over.

> 3. This function (dpif_netdev_execute) is not on a hot path in userspace
>datapath, IIUC.  It might be that it's just easier to remove the
>'do_not_steal' flag entirely, clone the packet here and call the
>dp_netdev_execute_actions() with should_steal=true.
>This sounds like the best solution for me, unless I overlooked some
>scenario, where this code is on a hot path.

I like this approach.

> Thoughts?
>
> Aaron, you have a patch[1] to remove 'do_not_steal' flag, so fix for this 
> issue
> will, likely, touch the same 

Re: [ovs-discuss] Patch for PACKET_OUT getting deleted twice crash

2021-06-04 Thread Ilya Maximets
> Here is a patch with both a test and a fix.

Hi.  Thanks for working n this!

CC: ovs-dev

> Not submitting as a formal
> patch because I would like some feedback on whether 1) maintainers feel
> this is worth fixing and

I can reproduce the crash with your test.  Basically, actions in userspace
datapath may drop packets if something goes wrong.  'meter' action just
seems to be the most explicit variant.  So, I think, this is definitely
worth fixing as some other condition might trigger this crash on packet-out
as well.

==2568112==ERROR: AddressSanitizer: heap-use-after-free
on address 0x6160699c at pc 0x00573860 bp 0x7ffebc6cc880 sp 
0x7ffebc6cc878
READ of size 1 at 0x6160699c thread T0
#0 0x57385f in dp_packet_delete lib/dp-packet.h:242:16
#1 0x57372c in ofproto_packet_out_uninit ofproto/ofproto.c:3562:5
#2 0x585e77 in handle_packet_out ofproto/ofproto.c:3722:5
#3 0x583801 in handle_single_part_openflow ofproto/ofproto.c:8499:16
#4 0x570c9c in handle_openflow ofproto/ofproto.c:8686:21
#5 0x611781 in ofconn_run ofproto/connmgr.c:1329:13
#6 0x6112ed in connmgr_run ofproto/connmgr.c:356:9
#7 0x56fdf4 in ofproto_run ofproto/ofproto.c:1891:5
#8 0x545ec0 in bridge_run__ vswitchd/bridge.c:3251:9
#9 0x5456a5 in bridge_run vswitchd/bridge.c:3310:5
#10 0x55f5b1 in main vswitchd/ovs-vswitchd.c:127:9
#11 0x7f85bfe09081 in __libc_start_main (/lib64/libc.so.6+0x27081)
#12 0x46d00d in _start (vswitchd/ovs-vswitchd+0x46d00d)

> 2) whether this is the way to fix it.

See inline.

> 
> I have tried to make the most minimal change possible, but this means that
> there might be paths through the code that give unexpected behaviour (which
> in the worst case would be a memory leak I suppose).
> 
> Tony
> 
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index 246be14d0..5e0dabe67 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -739,6 +739,7 @@ struct dp_packet_batch {
> size_t count;
> bool trunc; /* true if the batch needs truncate. */
> bool do_not_steal; /* Indicate that the packets should not be stolen.
> */
> +bool packet_out; /* Indicate single packet is PACKET_OUT */
> struct dp_packet *packets[NETDEV_MAX_BURST];
> };
> 
> @@ -748,6 +749,7 @@ dp_packet_batch_init(struct dp_packet_batch *batch)
> batch->count = 0;
> batch->trunc = false;
> batch->do_not_steal = false;
> +batch->packet_out = false;
> }
> 
> static inline void
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 650e67ab3..deba4a94a 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4170,6 +4170,7 @@ dpif_netdev_execute(struct dpif *dpif, struct
> dpif_execute *execute)
> 
> dp_packet_batch_init_packet(, execute->packet);
> pp.do_not_steal = true;
> +pp.packet_out = execute->packet_out;
> dp_netdev_execute_actions(pmd, , false, execute->flow,
>   execute->actions, execute->actions_len);

There is already a dirty hack named "do_not_steal" that was introduced,
I guess, exactly to avoid crash in the conntrack code that could drop/steal
the packet just like meter action.  And it seems that here in
dpif_netdev_execute() is the only problematic entry point as all other
normal paths expects that packet might be destroyed.

The problem was, I suppose, introduced when we tried to unify semantics
of "may_steal" flag by turning it into "should_steal".  But it seems that
in this function we really need to prohibit stealing of the packet since
ofproto layer still owns it regardless of the result of execution.

I don't think that we need one more flag here, but we have several options
how to fix the crash:

1. Start honoring batch->do_not_steal flag in all actions that may result
   in packet drops.  As the original idea of having 'do_not_steal' flag for
   a batch is very hacky, I'd like to not do that.

2. Try to propagate information that packet was deleted up to ofproto layer,
   i.e. make handle_packet_out() aware of that.  Will, probably, be not that
   easy to do.

3. This function (dpif_netdev_execute) is not on a hot path in userspace
   datapath, IIUC.  It might be that it's just easier to remove the
   'do_not_steal' flag entirely, clone the packet here and call the
   dp_netdev_execute_actions() with should_steal=true.
   This sounds like the best solution for me, unless I overlooked some
   scenario, where this code is on a hot path.

Thoughts?

Aaron, you have a patch[1] to remove 'do_not_steal' flag, so fix for this issue
will, likely, touch the same parts of the code.  What do you think about this
issue and possible solutions?

Best regards, Ilya Maximets.


[1] 
https://patchwork.ozlabs.org/project/openvswitch/patch/20210521175905.165779-1-acon...@redhat.com/

Non-line-wrapped version of the test for convenience:

diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
index 31064ed95..d01f438b8 100644
--- a/tests/ofproto-dpif.at
+++ b/tests/ofproto-dpif.at
@@ -2159,6