Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-06-20 Thread Ilya Maximets
On 19.06.2018 19:03, Ben Pfaff wrote:
> On Tue, Jun 19, 2018 at 10:58:36AM +0300, Ilya Maximets wrote:
>> On 18.06.2018 22:24, Ben Pfaff wrote:
>>> On Mon, Jun 18, 2018 at 06:34:13PM +0300, Ilya Maximets wrote:
 On 18.06.2018 18:07, Ben Pfaff wrote:
> On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote:
>>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
 On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
> On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:
>>
>> Until now, rconn_get_version() has only reported the OpenFlow 
>> version in
>> use when the rconn is actually connected.  This makes sense, but it 
>> has a
>> harsh consequence.  Consider code like this:
>>
>> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
>> for (int i = 0; i < 2; i++) {
>> struct ofpbuf *b = ofputil_encode_echo_request(
>> rconn_get_version(rconn));
>> rconn_send(rconn, b, NULL);
>> }
>> }
>>
>> Maybe not the smartest code in the world, and probably no one would 
>> write
>> this exact code in any case, but it doesn't look too risky or crazy.
>>
>> But it is.  The second trip through the loop can assert-fail inside
>> ofputil_encode_echo_request() because rconn_get_version(rconn) 
>> returns -1
>> instead of a valid OpenFlow version.  That happens if the first call 
>> to
>> rconn_send() encounters an error while sending the message and 
>> therefore
>> destroys the underlying vconn and disconnects so that 
>> rconn_get_version()
>> doesn't have a vconn to query for its version.
>>
>> In a case like this where all the code to send the messages is close 
>> by,
> we
>> could just check rconn_get_version() in each loop iteration.  We 
>> could
> even
>> go through the tree and convince ourselves that individual bits of 
>> code
> are
>> safe, or be conservative and check rconn_get_version() >= 0 in the 
>> iffy
>> cases.  But this seems to me like an ongoing source of risk and a 
>> way to
>> get things wrong in corner cases.
>>
>> This commit takes a different approach.  It introduces a new 
>> invariant: if
>> an rconn has ever been connected, then it returns a valid OpenFlow 
>> version
>> from rconn_get_version().  In addition, if an rconn is currently
> connected,
>> then the OpenFlow version it returns is the correct one (that may be
>> obvious, but there were corner cases before where it returned -1 even
>> though rconn_is_connected() returned true).
>>
>> With this commit, the code above would work OK.  If the first call to
>> rconn_send() encounters an error sending the message, then
>> rconn_get_version() in the second iteration will return the same 
>> value as
>> in the first iteration.  The message passed to rconn_send() will end 
>> up
>> being discarded, but that's much better than either an assertion 
>> failure
> or
>> having to carefully analyze a lot of our code to deal with one 
>> unusual
>> corner case.
>>
>> Reported-by: Han Zhou 
>> Signed-off-by: Ben Pfaff 
>> ---
>>  lib/learning-switch.c |  2 +-
>>  lib/rconn.c   | 41 -
>>  lib/vconn.c   |  1 +
>>  3 files changed, 18 insertions(+), 26 deletions(-)
>>
> Acked-by: Han Zhou 

 Thanks.  I applied this to master.  I'll backport it to older versions
 if no one notices trouble soon.
>>>
>>> I backported as far as branch-2.5.
>>
>> Sorry for being late, but I just tried to use branch-2.9 on my test 
>> environment
>> and found that this patch produces a lot of "connected" logs that are a 
>> bit annoying:
>>
>> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
>> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
>> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
>> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
>> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
>> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
>> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
>> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
>> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected

Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-06-19 Thread Ben Pfaff
On Tue, Jun 19, 2018 at 10:58:36AM +0300, Ilya Maximets wrote:
> On 18.06.2018 22:24, Ben Pfaff wrote:
> > On Mon, Jun 18, 2018 at 06:34:13PM +0300, Ilya Maximets wrote:
> >> On 18.06.2018 18:07, Ben Pfaff wrote:
> >>> On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote:
> > On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
> >> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
> >>> On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:
> 
>  Until now, rconn_get_version() has only reported the OpenFlow 
>  version in
>  use when the rconn is actually connected.  This makes sense, but it 
>  has a
>  harsh consequence.  Consider code like this:
> 
>  if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
>  for (int i = 0; i < 2; i++) {
>  struct ofpbuf *b = ofputil_encode_echo_request(
>  rconn_get_version(rconn));
>  rconn_send(rconn, b, NULL);
>  }
>  }
> 
>  Maybe not the smartest code in the world, and probably no one would 
>  write
>  this exact code in any case, but it doesn't look too risky or crazy.
> 
>  But it is.  The second trip through the loop can assert-fail inside
>  ofputil_encode_echo_request() because rconn_get_version(rconn) 
>  returns -1
>  instead of a valid OpenFlow version.  That happens if the first call 
>  to
>  rconn_send() encounters an error while sending the message and 
>  therefore
>  destroys the underlying vconn and disconnects so that 
>  rconn_get_version()
>  doesn't have a vconn to query for its version.
> 
>  In a case like this where all the code to send the messages is close 
>  by,
> >>> we
>  could just check rconn_get_version() in each loop iteration.  We 
>  could
> >>> even
>  go through the tree and convince ourselves that individual bits of 
>  code
> >>> are
>  safe, or be conservative and check rconn_get_version() >= 0 in the 
>  iffy
>  cases.  But this seems to me like an ongoing source of risk and a 
>  way to
>  get things wrong in corner cases.
> 
>  This commit takes a different approach.  It introduces a new 
>  invariant: if
>  an rconn has ever been connected, then it returns a valid OpenFlow 
>  version
>  from rconn_get_version().  In addition, if an rconn is currently
> >>> connected,
>  then the OpenFlow version it returns is the correct one (that may be
>  obvious, but there were corner cases before where it returned -1 even
>  though rconn_is_connected() returned true).
> 
>  With this commit, the code above would work OK.  If the first call to
>  rconn_send() encounters an error sending the message, then
>  rconn_get_version() in the second iteration will return the same 
>  value as
>  in the first iteration.  The message passed to rconn_send() will end 
>  up
>  being discarded, but that's much better than either an assertion 
>  failure
> >>> or
>  having to carefully analyze a lot of our code to deal with one 
>  unusual
>  corner case.
> 
>  Reported-by: Han Zhou 
>  Signed-off-by: Ben Pfaff 
>  ---
>   lib/learning-switch.c |  2 +-
>   lib/rconn.c   | 41 -
>   lib/vconn.c   |  1 +
>   3 files changed, 18 insertions(+), 26 deletions(-)
> 
> >>> Acked-by: Han Zhou 
> >>
> >> Thanks.  I applied this to master.  I'll backport it to older versions
> >> if no one notices trouble soon.
> >
> > I backported as far as branch-2.5.
> 
>  Sorry for being late, but I just tried to use branch-2.9 on my test 
>  environment
>  and found that this patch produces a lot of "connected" logs that are a 
>  bit annoying:
> 
>  2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
>  2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
>  2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
>  2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
>  2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
>  2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
>  2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
>  2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
>  2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
>  

Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-06-19 Thread Ilya Maximets
On 18.06.2018 22:24, Ben Pfaff wrote:
> On Mon, Jun 18, 2018 at 06:34:13PM +0300, Ilya Maximets wrote:
>> On 18.06.2018 18:07, Ben Pfaff wrote:
>>> On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote:
> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
>> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
>>> On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:

 Until now, rconn_get_version() has only reported the OpenFlow version 
 in
 use when the rconn is actually connected.  This makes sense, but it 
 has a
 harsh consequence.  Consider code like this:

 if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
 for (int i = 0; i < 2; i++) {
 struct ofpbuf *b = ofputil_encode_echo_request(
 rconn_get_version(rconn));
 rconn_send(rconn, b, NULL);
 }
 }

 Maybe not the smartest code in the world, and probably no one would 
 write
 this exact code in any case, but it doesn't look too risky or crazy.

 But it is.  The second trip through the loop can assert-fail inside
 ofputil_encode_echo_request() because rconn_get_version(rconn) returns 
 -1
 instead of a valid OpenFlow version.  That happens if the first call to
 rconn_send() encounters an error while sending the message and 
 therefore
 destroys the underlying vconn and disconnects so that 
 rconn_get_version()
 doesn't have a vconn to query for its version.

 In a case like this where all the code to send the messages is close 
 by,
>>> we
 could just check rconn_get_version() in each loop iteration.  We could
>>> even
 go through the tree and convince ourselves that individual bits of code
>>> are
 safe, or be conservative and check rconn_get_version() >= 0 in the iffy
 cases.  But this seems to me like an ongoing source of risk and a way 
 to
 get things wrong in corner cases.

 This commit takes a different approach.  It introduces a new 
 invariant: if
 an rconn has ever been connected, then it returns a valid OpenFlow 
 version
 from rconn_get_version().  In addition, if an rconn is currently
>>> connected,
 then the OpenFlow version it returns is the correct one (that may be
 obvious, but there were corner cases before where it returned -1 even
 though rconn_is_connected() returned true).

 With this commit, the code above would work OK.  If the first call to
 rconn_send() encounters an error sending the message, then
 rconn_get_version() in the second iteration will return the same value 
 as
 in the first iteration.  The message passed to rconn_send() will end up
 being discarded, but that's much better than either an assertion 
 failure
>>> or
 having to carefully analyze a lot of our code to deal with one unusual
 corner case.

 Reported-by: Han Zhou 
 Signed-off-by: Ben Pfaff 
 ---
  lib/learning-switch.c |  2 +-
  lib/rconn.c   | 41 -
  lib/vconn.c   |  1 +
  3 files changed, 18 insertions(+), 26 deletions(-)

>>> Acked-by: Han Zhou 
>>
>> Thanks.  I applied this to master.  I'll backport it to older versions
>> if no one notices trouble soon.
>
> I backported as far as branch-2.5.

 Sorry for being late, but I just tried to use branch-2.9 on my test 
 environment
 and found that this patch produces a lot of "connected" logs that are a 
 bit annoying:

 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected
 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected
 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected
 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected
 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: 

Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-06-19 Thread Ilya Maximets
On 18.06.2018 22:25, Ben Pfaff wrote:
> On Mon, Jun 18, 2018 at 09:31:43AM -0700, Han Zhou wrote:
>> On Mon, Jun 18, 2018 at 8:34 AM, Ilya Maximets 
>> wrote:
>>>
>>> On 18.06.2018 18:07, Ben Pfaff wrote:
 On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote:
>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
>>> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
 On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:
>
> Until now, rconn_get_version() has only reported the OpenFlow
>> version in
> use when the rconn is actually connected.  This makes sense, but
>> it has a
> harsh consequence.  Consider code like this:
>
> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >=
>> 0) {
> for (int i = 0; i < 2; i++) {
> struct ofpbuf *b = ofputil_encode_echo_request(
> rconn_get_version(rconn));
> rconn_send(rconn, b, NULL);
> }
> }
>
> Maybe not the smartest code in the world, and probably no one
>> would write
> this exact code in any case, but it doesn't look too risky or
>> crazy.
>
> But it is.  The second trip through the loop can assert-fail inside
> ofputil_encode_echo_request() because rconn_get_version(rconn)
>> returns -1
> instead of a valid OpenFlow version.  That happens if the first
>> call to
> rconn_send() encounters an error while sending the message and
>> therefore
> destroys the underlying vconn and disconnects so that
>> rconn_get_version()
> doesn't have a vconn to query for its version.
>
> In a case like this where all the code to send the messages is
>> close by,
 we
> could just check rconn_get_version() in each loop iteration.  We
>> could
 even
> go through the tree and convince ourselves that individual bits of
>> code
 are
> safe, or be conservative and check rconn_get_version() >= 0 in the
>> iffy
> cases.  But this seems to me like an ongoing source of risk and a
>> way to
> get things wrong in corner cases.
>
> This commit takes a different approach.  It introduces a new
>> invariant: if
> an rconn has ever been connected, then it returns a valid OpenFlow
>> version
> from rconn_get_version().  In addition, if an rconn is currently
 connected,
> then the OpenFlow version it returns is the correct one (that may
>> be
> obvious, but there were corner cases before where it returned -1
>> even
> though rconn_is_connected() returned true).
>
> With this commit, the code above would work OK.  If the first call
>> to
> rconn_send() encounters an error sending the message, then
> rconn_get_version() in the second iteration will return the same
>> value as
> in the first iteration.  The message passed to rconn_send() will
>> end up
> being discarded, but that's much better than either an assertion
>> failure
 or
> having to carefully analyze a lot of our code to deal with one
>> unusual
> corner case.
>
> Reported-by: Han Zhou 
> Signed-off-by: Ben Pfaff 
> ---
>  lib/learning-switch.c |  2 +-
>  lib/rconn.c   | 41
>> -
>  lib/vconn.c   |  1 +
>  3 files changed, 18 insertions(+), 26 deletions(-)
>
 Acked-by: Han Zhou 
>>>
>>> Thanks.  I applied this to master.  I'll backport it to older
>> versions
>>> if no one notices trouble soon.
>>
>> I backported as far as branch-2.5.
>
> Sorry for being late, but I just tried to use branch-2.9 on my test
>> environment
> and found that this patch produces a lot of "connected" logs that are
>> a bit annoying:
>
> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected
> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected
> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected
> 

Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-06-18 Thread Ben Pfaff
On Mon, Jun 18, 2018 at 09:31:43AM -0700, Han Zhou wrote:
> On Mon, Jun 18, 2018 at 8:34 AM, Ilya Maximets 
> wrote:
> >
> > On 18.06.2018 18:07, Ben Pfaff wrote:
> > > On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote:
> > >>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
> >  On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
> > > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:
> > >>
> > >> Until now, rconn_get_version() has only reported the OpenFlow
> version in
> > >> use when the rconn is actually connected.  This makes sense, but
> it has a
> > >> harsh consequence.  Consider code like this:
> > >>
> > >> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >=
> 0) {
> > >> for (int i = 0; i < 2; i++) {
> > >> struct ofpbuf *b = ofputil_encode_echo_request(
> > >> rconn_get_version(rconn));
> > >> rconn_send(rconn, b, NULL);
> > >> }
> > >> }
> > >>
> > >> Maybe not the smartest code in the world, and probably no one
> would write
> > >> this exact code in any case, but it doesn't look too risky or
> crazy.
> > >>
> > >> But it is.  The second trip through the loop can assert-fail inside
> > >> ofputil_encode_echo_request() because rconn_get_version(rconn)
> returns -1
> > >> instead of a valid OpenFlow version.  That happens if the first
> call to
> > >> rconn_send() encounters an error while sending the message and
> therefore
> > >> destroys the underlying vconn and disconnects so that
> rconn_get_version()
> > >> doesn't have a vconn to query for its version.
> > >>
> > >> In a case like this where all the code to send the messages is
> close by,
> > > we
> > >> could just check rconn_get_version() in each loop iteration.  We
> could
> > > even
> > >> go through the tree and convince ourselves that individual bits of
> code
> > > are
> > >> safe, or be conservative and check rconn_get_version() >= 0 in the
> iffy
> > >> cases.  But this seems to me like an ongoing source of risk and a
> way to
> > >> get things wrong in corner cases.
> > >>
> > >> This commit takes a different approach.  It introduces a new
> invariant: if
> > >> an rconn has ever been connected, then it returns a valid OpenFlow
> version
> > >> from rconn_get_version().  In addition, if an rconn is currently
> > > connected,
> > >> then the OpenFlow version it returns is the correct one (that may
> be
> > >> obvious, but there were corner cases before where it returned -1
> even
> > >> though rconn_is_connected() returned true).
> > >>
> > >> With this commit, the code above would work OK.  If the first call
> to
> > >> rconn_send() encounters an error sending the message, then
> > >> rconn_get_version() in the second iteration will return the same
> value as
> > >> in the first iteration.  The message passed to rconn_send() will
> end up
> > >> being discarded, but that's much better than either an assertion
> failure
> > > or
> > >> having to carefully analyze a lot of our code to deal with one
> unusual
> > >> corner case.
> > >>
> > >> Reported-by: Han Zhou 
> > >> Signed-off-by: Ben Pfaff 
> > >> ---
> > >>  lib/learning-switch.c |  2 +-
> > >>  lib/rconn.c   | 41
> -
> > >>  lib/vconn.c   |  1 +
> > >>  3 files changed, 18 insertions(+), 26 deletions(-)
> > >>
> > > Acked-by: Han Zhou 
> > 
> >  Thanks.  I applied this to master.  I'll backport it to older
> versions
> >  if no one notices trouble soon.
> > >>>
> > >>> I backported as far as branch-2.5.
> > >>
> > >> Sorry for being late, but I just tried to use branch-2.9 on my test
> environment
> > >> and found that this patch produces a lot of "connected" logs that are
> a bit annoying:
> > >>
> > >> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
> > >> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
> > >> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
> > >> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
> > >> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
> > >> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
> > >> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
> > >> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
> > >> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
> > >> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected
> > >> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected
> > >> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected
> > >> 

Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-06-18 Thread Ben Pfaff
On Mon, Jun 18, 2018 at 06:34:13PM +0300, Ilya Maximets wrote:
> On 18.06.2018 18:07, Ben Pfaff wrote:
> > On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote:
> >>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
>  On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
> > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:
> >>
> >> Until now, rconn_get_version() has only reported the OpenFlow version 
> >> in
> >> use when the rconn is actually connected.  This makes sense, but it 
> >> has a
> >> harsh consequence.  Consider code like this:
> >>
> >> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
> >> for (int i = 0; i < 2; i++) {
> >> struct ofpbuf *b = ofputil_encode_echo_request(
> >> rconn_get_version(rconn));
> >> rconn_send(rconn, b, NULL);
> >> }
> >> }
> >>
> >> Maybe not the smartest code in the world, and probably no one would 
> >> write
> >> this exact code in any case, but it doesn't look too risky or crazy.
> >>
> >> But it is.  The second trip through the loop can assert-fail inside
> >> ofputil_encode_echo_request() because rconn_get_version(rconn) returns 
> >> -1
> >> instead of a valid OpenFlow version.  That happens if the first call to
> >> rconn_send() encounters an error while sending the message and 
> >> therefore
> >> destroys the underlying vconn and disconnects so that 
> >> rconn_get_version()
> >> doesn't have a vconn to query for its version.
> >>
> >> In a case like this where all the code to send the messages is close 
> >> by,
> > we
> >> could just check rconn_get_version() in each loop iteration.  We could
> > even
> >> go through the tree and convince ourselves that individual bits of code
> > are
> >> safe, or be conservative and check rconn_get_version() >= 0 in the iffy
> >> cases.  But this seems to me like an ongoing source of risk and a way 
> >> to
> >> get things wrong in corner cases.
> >>
> >> This commit takes a different approach.  It introduces a new 
> >> invariant: if
> >> an rconn has ever been connected, then it returns a valid OpenFlow 
> >> version
> >> from rconn_get_version().  In addition, if an rconn is currently
> > connected,
> >> then the OpenFlow version it returns is the correct one (that may be
> >> obvious, but there were corner cases before where it returned -1 even
> >> though rconn_is_connected() returned true).
> >>
> >> With this commit, the code above would work OK.  If the first call to
> >> rconn_send() encounters an error sending the message, then
> >> rconn_get_version() in the second iteration will return the same value 
> >> as
> >> in the first iteration.  The message passed to rconn_send() will end up
> >> being discarded, but that's much better than either an assertion 
> >> failure
> > or
> >> having to carefully analyze a lot of our code to deal with one unusual
> >> corner case.
> >>
> >> Reported-by: Han Zhou 
> >> Signed-off-by: Ben Pfaff 
> >> ---
> >>  lib/learning-switch.c |  2 +-
> >>  lib/rconn.c   | 41 -
> >>  lib/vconn.c   |  1 +
> >>  3 files changed, 18 insertions(+), 26 deletions(-)
> >>
> > Acked-by: Han Zhou 
> 
>  Thanks.  I applied this to master.  I'll backport it to older versions
>  if no one notices trouble soon.
> >>>
> >>> I backported as far as branch-2.5.
> >>
> >> Sorry for being late, but I just tried to use branch-2.9 on my test 
> >> environment
> >> and found that this patch produces a lot of "connected" logs that are a 
> >> bit annoying:
> >>
> >> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
> >> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
> >> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
> >> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
> >> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
> >> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
> >> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
> >> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
> >> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
> >> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected
> >> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected
> >> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected
> >> 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected
> >> 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected
> >> 

Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-06-18 Thread Han Zhou
On Mon, Jun 18, 2018 at 8:34 AM, Ilya Maximets 
wrote:
>
> On 18.06.2018 18:07, Ben Pfaff wrote:
> > On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote:
> >>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
>  On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
> > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:
> >>
> >> Until now, rconn_get_version() has only reported the OpenFlow
version in
> >> use when the rconn is actually connected.  This makes sense, but
it has a
> >> harsh consequence.  Consider code like this:
> >>
> >> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >=
0) {
> >> for (int i = 0; i < 2; i++) {
> >> struct ofpbuf *b = ofputil_encode_echo_request(
> >> rconn_get_version(rconn));
> >> rconn_send(rconn, b, NULL);
> >> }
> >> }
> >>
> >> Maybe not the smartest code in the world, and probably no one
would write
> >> this exact code in any case, but it doesn't look too risky or
crazy.
> >>
> >> But it is.  The second trip through the loop can assert-fail inside
> >> ofputil_encode_echo_request() because rconn_get_version(rconn)
returns -1
> >> instead of a valid OpenFlow version.  That happens if the first
call to
> >> rconn_send() encounters an error while sending the message and
therefore
> >> destroys the underlying vconn and disconnects so that
rconn_get_version()
> >> doesn't have a vconn to query for its version.
> >>
> >> In a case like this where all the code to send the messages is
close by,
> > we
> >> could just check rconn_get_version() in each loop iteration.  We
could
> > even
> >> go through the tree and convince ourselves that individual bits of
code
> > are
> >> safe, or be conservative and check rconn_get_version() >= 0 in the
iffy
> >> cases.  But this seems to me like an ongoing source of risk and a
way to
> >> get things wrong in corner cases.
> >>
> >> This commit takes a different approach.  It introduces a new
invariant: if
> >> an rconn has ever been connected, then it returns a valid OpenFlow
version
> >> from rconn_get_version().  In addition, if an rconn is currently
> > connected,
> >> then the OpenFlow version it returns is the correct one (that may
be
> >> obvious, but there were corner cases before where it returned -1
even
> >> though rconn_is_connected() returned true).
> >>
> >> With this commit, the code above would work OK.  If the first call
to
> >> rconn_send() encounters an error sending the message, then
> >> rconn_get_version() in the second iteration will return the same
value as
> >> in the first iteration.  The message passed to rconn_send() will
end up
> >> being discarded, but that's much better than either an assertion
failure
> > or
> >> having to carefully analyze a lot of our code to deal with one
unusual
> >> corner case.
> >>
> >> Reported-by: Han Zhou 
> >> Signed-off-by: Ben Pfaff 
> >> ---
> >>  lib/learning-switch.c |  2 +-
> >>  lib/rconn.c   | 41
-
> >>  lib/vconn.c   |  1 +
> >>  3 files changed, 18 insertions(+), 26 deletions(-)
> >>
> > Acked-by: Han Zhou 
> 
>  Thanks.  I applied this to master.  I'll backport it to older
versions
>  if no one notices trouble soon.
> >>>
> >>> I backported as far as branch-2.5.
> >>
> >> Sorry for being late, but I just tried to use branch-2.9 on my test
environment
> >> and found that this patch produces a lot of "connected" logs that are
a bit annoying:
> >>
> >> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
> >> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
> >> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
> >> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
> >> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
> >> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
> >> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
> >> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
> >> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
> >> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected
> >> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected
> >> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected
> >> 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected
> >> 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected
> >> 2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected
> >>
> >> On my system it's about 2 log messages on each 2 seconds. It has only
> >> 

Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-06-18 Thread Ilya Maximets
On 18.06.2018 18:07, Ben Pfaff wrote:
> On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote:
>>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
 On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
> On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:
>>
>> Until now, rconn_get_version() has only reported the OpenFlow version in
>> use when the rconn is actually connected.  This makes sense, but it has a
>> harsh consequence.  Consider code like this:
>>
>> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
>> for (int i = 0; i < 2; i++) {
>> struct ofpbuf *b = ofputil_encode_echo_request(
>> rconn_get_version(rconn));
>> rconn_send(rconn, b, NULL);
>> }
>> }
>>
>> Maybe not the smartest code in the world, and probably no one would write
>> this exact code in any case, but it doesn't look too risky or crazy.
>>
>> But it is.  The second trip through the loop can assert-fail inside
>> ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1
>> instead of a valid OpenFlow version.  That happens if the first call to
>> rconn_send() encounters an error while sending the message and therefore
>> destroys the underlying vconn and disconnects so that rconn_get_version()
>> doesn't have a vconn to query for its version.
>>
>> In a case like this where all the code to send the messages is close by,
> we
>> could just check rconn_get_version() in each loop iteration.  We could
> even
>> go through the tree and convince ourselves that individual bits of code
> are
>> safe, or be conservative and check rconn_get_version() >= 0 in the iffy
>> cases.  But this seems to me like an ongoing source of risk and a way to
>> get things wrong in corner cases.
>>
>> This commit takes a different approach.  It introduces a new invariant: 
>> if
>> an rconn has ever been connected, then it returns a valid OpenFlow 
>> version
>> from rconn_get_version().  In addition, if an rconn is currently
> connected,
>> then the OpenFlow version it returns is the correct one (that may be
>> obvious, but there were corner cases before where it returned -1 even
>> though rconn_is_connected() returned true).
>>
>> With this commit, the code above would work OK.  If the first call to
>> rconn_send() encounters an error sending the message, then
>> rconn_get_version() in the second iteration will return the same value as
>> in the first iteration.  The message passed to rconn_send() will end up
>> being discarded, but that's much better than either an assertion failure
> or
>> having to carefully analyze a lot of our code to deal with one unusual
>> corner case.
>>
>> Reported-by: Han Zhou 
>> Signed-off-by: Ben Pfaff 
>> ---
>>  lib/learning-switch.c |  2 +-
>>  lib/rconn.c   | 41 -
>>  lib/vconn.c   |  1 +
>>  3 files changed, 18 insertions(+), 26 deletions(-)
>>
> Acked-by: Han Zhou 

 Thanks.  I applied this to master.  I'll backport it to older versions
 if no one notices trouble soon.
>>>
>>> I backported as far as branch-2.5.
>>
>> Sorry for being late, but I just tried to use branch-2.9 on my test 
>> environment
>> and found that this patch produces a lot of "connected" logs that are a bit 
>> annoying:
>>
>> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
>> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
>> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
>> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
>> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
>> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
>> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
>> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
>> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
>> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected
>> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected
>> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected
>> 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected
>> 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected
>> 2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected
>>
>> On my system it's about 2 log messages on each 2 seconds. It has only
>> ovsdb-server, ovs-vswitchd and ovsdb-client monitor running.
>>
>> Could you please check?
> 
> Are you sure that it's this patch that makes a difference?  It doesn't
> change anything in the logging.

I don't know much 

Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-06-18 Thread Ben Pfaff
On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote:
> > On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
> >> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
> >> > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:
> >> > >
> >> > > Until now, rconn_get_version() has only reported the OpenFlow version 
> >> > > in
> >> > > use when the rconn is actually connected.  This makes sense, but it 
> >> > > has a
> >> > > harsh consequence.  Consider code like this:
> >> > >
> >> > > if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
> >> > > for (int i = 0; i < 2; i++) {
> >> > > struct ofpbuf *b = ofputil_encode_echo_request(
> >> > > rconn_get_version(rconn));
> >> > > rconn_send(rconn, b, NULL);
> >> > > }
> >> > > }
> >> > >
> >> > > Maybe not the smartest code in the world, and probably no one would 
> >> > > write
> >> > > this exact code in any case, but it doesn't look too risky or crazy.
> >> > >
> >> > > But it is.  The second trip through the loop can assert-fail inside
> >> > > ofputil_encode_echo_request() because rconn_get_version(rconn) returns 
> >> > > -1
> >> > > instead of a valid OpenFlow version.  That happens if the first call to
> >> > > rconn_send() encounters an error while sending the message and 
> >> > > therefore
> >> > > destroys the underlying vconn and disconnects so that 
> >> > > rconn_get_version()
> >> > > doesn't have a vconn to query for its version.
> >> > >
> >> > > In a case like this where all the code to send the messages is close 
> >> > > by,
> >> > we
> >> > > could just check rconn_get_version() in each loop iteration.  We could
> >> > even
> >> > > go through the tree and convince ourselves that individual bits of code
> >> > are
> >> > > safe, or be conservative and check rconn_get_version() >= 0 in the iffy
> >> > > cases.  But this seems to me like an ongoing source of risk and a way 
> >> > > to
> >> > > get things wrong in corner cases.
> >> > >
> >> > > This commit takes a different approach.  It introduces a new 
> >> > > invariant: if
> >> > > an rconn has ever been connected, then it returns a valid OpenFlow 
> >> > > version
> >> > > from rconn_get_version().  In addition, if an rconn is currently
> >> > connected,
> >> > > then the OpenFlow version it returns is the correct one (that may be
> >> > > obvious, but there were corner cases before where it returned -1 even
> >> > > though rconn_is_connected() returned true).
> >> > >
> >> > > With this commit, the code above would work OK.  If the first call to
> >> > > rconn_send() encounters an error sending the message, then
> >> > > rconn_get_version() in the second iteration will return the same value 
> >> > > as
> >> > > in the first iteration.  The message passed to rconn_send() will end up
> >> > > being discarded, but that's much better than either an assertion 
> >> > > failure
> >> > or
> >> > > having to carefully analyze a lot of our code to deal with one unusual
> >> > > corner case.
> >> > >
> >> > > Reported-by: Han Zhou 
> >> > > Signed-off-by: Ben Pfaff 
> >> > > ---
> >> > >  lib/learning-switch.c |  2 +-
> >> > >  lib/rconn.c   | 41 -
> >> > >  lib/vconn.c   |  1 +
> >> > >  3 files changed, 18 insertions(+), 26 deletions(-)
> >> > >
> >> > Acked-by: Han Zhou 
> >> 
> >> Thanks.  I applied this to master.  I'll backport it to older versions
> >> if no one notices trouble soon.
> > 
> > I backported as far as branch-2.5.
> 
> Sorry for being late, but I just tried to use branch-2.9 on my test 
> environment
> and found that this patch produces a lot of "connected" logs that are a bit 
> annoying:
> 
> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected
> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected
> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected
> 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected
> 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected
> 2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected
> 
> On my system it's about 2 log messages on each 2 seconds. It has only
> ovsdb-server, ovs-vswitchd and ovsdb-client monitor running.
> 

Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-06-18 Thread Ilya Maximets
> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
>> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
>> > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:
>> > >
>> > > Until now, rconn_get_version() has only reported the OpenFlow version in
>> > > use when the rconn is actually connected.  This makes sense, but it has a
>> > > harsh consequence.  Consider code like this:
>> > >
>> > > if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
>> > > for (int i = 0; i < 2; i++) {
>> > > struct ofpbuf *b = ofputil_encode_echo_request(
>> > > rconn_get_version(rconn));
>> > > rconn_send(rconn, b, NULL);
>> > > }
>> > > }
>> > >
>> > > Maybe not the smartest code in the world, and probably no one would write
>> > > this exact code in any case, but it doesn't look too risky or crazy.
>> > >
>> > > But it is.  The second trip through the loop can assert-fail inside
>> > > ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1
>> > > instead of a valid OpenFlow version.  That happens if the first call to
>> > > rconn_send() encounters an error while sending the message and therefore
>> > > destroys the underlying vconn and disconnects so that rconn_get_version()
>> > > doesn't have a vconn to query for its version.
>> > >
>> > > In a case like this where all the code to send the messages is close by,
>> > we
>> > > could just check rconn_get_version() in each loop iteration.  We could
>> > even
>> > > go through the tree and convince ourselves that individual bits of code
>> > are
>> > > safe, or be conservative and check rconn_get_version() >= 0 in the iffy
>> > > cases.  But this seems to me like an ongoing source of risk and a way to
>> > > get things wrong in corner cases.
>> > >
>> > > This commit takes a different approach.  It introduces a new invariant: 
>> > > if
>> > > an rconn has ever been connected, then it returns a valid OpenFlow 
>> > > version
>> > > from rconn_get_version().  In addition, if an rconn is currently
>> > connected,
>> > > then the OpenFlow version it returns is the correct one (that may be
>> > > obvious, but there were corner cases before where it returned -1 even
>> > > though rconn_is_connected() returned true).
>> > >
>> > > With this commit, the code above would work OK.  If the first call to
>> > > rconn_send() encounters an error sending the message, then
>> > > rconn_get_version() in the second iteration will return the same value as
>> > > in the first iteration.  The message passed to rconn_send() will end up
>> > > being discarded, but that's much better than either an assertion failure
>> > or
>> > > having to carefully analyze a lot of our code to deal with one unusual
>> > > corner case.
>> > >
>> > > Reported-by: Han Zhou 
>> > > Signed-off-by: Ben Pfaff 
>> > > ---
>> > >  lib/learning-switch.c |  2 +-
>> > >  lib/rconn.c   | 41 -
>> > >  lib/vconn.c   |  1 +
>> > >  3 files changed, 18 insertions(+), 26 deletions(-)
>> > >
>> > Acked-by: Han Zhou 
>> 
>> Thanks.  I applied this to master.  I'll backport it to older versions
>> if no one notices trouble soon.
> 
> I backported as far as branch-2.5.

Sorry for being late, but I just tried to use branch-2.9 on my test environment
and found that this patch produces a lot of "connected" logs that are a bit 
annoying:

2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected
2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected
2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected
2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected
2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected
2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected

On my system it's about 2 log messages on each 2 seconds. It has only
ovsdb-server, ovs-vswitchd and ovsdb-client monitor running.

Could you please check?

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-06-15 Thread Ben Pfaff
On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
> > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:
> > >
> > > Until now, rconn_get_version() has only reported the OpenFlow version in
> > > use when the rconn is actually connected.  This makes sense, but it has a
> > > harsh consequence.  Consider code like this:
> > >
> > > if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
> > > for (int i = 0; i < 2; i++) {
> > > struct ofpbuf *b = ofputil_encode_echo_request(
> > > rconn_get_version(rconn));
> > > rconn_send(rconn, b, NULL);
> > > }
> > > }
> > >
> > > Maybe not the smartest code in the world, and probably no one would write
> > > this exact code in any case, but it doesn't look too risky or crazy.
> > >
> > > But it is.  The second trip through the loop can assert-fail inside
> > > ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1
> > > instead of a valid OpenFlow version.  That happens if the first call to
> > > rconn_send() encounters an error while sending the message and therefore
> > > destroys the underlying vconn and disconnects so that rconn_get_version()
> > > doesn't have a vconn to query for its version.
> > >
> > > In a case like this where all the code to send the messages is close by,
> > we
> > > could just check rconn_get_version() in each loop iteration.  We could
> > even
> > > go through the tree and convince ourselves that individual bits of code
> > are
> > > safe, or be conservative and check rconn_get_version() >= 0 in the iffy
> > > cases.  But this seems to me like an ongoing source of risk and a way to
> > > get things wrong in corner cases.
> > >
> > > This commit takes a different approach.  It introduces a new invariant: if
> > > an rconn has ever been connected, then it returns a valid OpenFlow version
> > > from rconn_get_version().  In addition, if an rconn is currently
> > connected,
> > > then the OpenFlow version it returns is the correct one (that may be
> > > obvious, but there were corner cases before where it returned -1 even
> > > though rconn_is_connected() returned true).
> > >
> > > With this commit, the code above would work OK.  If the first call to
> > > rconn_send() encounters an error sending the message, then
> > > rconn_get_version() in the second iteration will return the same value as
> > > in the first iteration.  The message passed to rconn_send() will end up
> > > being discarded, but that's much better than either an assertion failure
> > or
> > > having to carefully analyze a lot of our code to deal with one unusual
> > > corner case.
> > >
> > > Reported-by: Han Zhou 
> > > Signed-off-by: Ben Pfaff 
> > > ---
> > >  lib/learning-switch.c |  2 +-
> > >  lib/rconn.c   | 41 -
> > >  lib/vconn.c   |  1 +
> > >  3 files changed, 18 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/lib/learning-switch.c b/lib/learning-switch.c
> > > index 04eaa6e8e73f..8102475cae52 100644
> > > --- a/lib/learning-switch.c
> > > +++ b/lib/learning-switch.c
> > > @@ -305,7 +305,7 @@ lswitch_run(struct lswitch *sw)
> > >  rconn_run(sw->rconn);
> > >
> > >  if (sw->state == S_CONNECTING) {
> > > -if (rconn_get_version(sw->rconn) != -1) {
> > > +if (rconn_is_connected(sw->rconn)) {
> > >  lswitch_handshake(sw);
> > >  sw->state = S_FEATURES_REPLY;
> > >  }
> > > diff --git a/lib/rconn.c b/lib/rconn.c
> > > index 3cad259d0178..a3f811fedbe3 100644
> > > --- a/lib/rconn.c
> > > +++ b/lib/rconn.c
> > > @@ -141,7 +141,8 @@ struct rconn {
> > >  struct vconn *monitors[MAXIMUM_MONITORS];
> > >  size_t n_monitors;
> > >
> > > -uint32_t allowed_versions;
> > > +uint32_t allowed_versions;  /* Acceptable OpenFlow versions. */
> > > +int version;/* Current or most recent version. */
> > >  };
> > >
> > >  /* Counts packets and bytes queued into an rconn by a given source. */
> > > @@ -182,8 +183,6 @@ static bool is_connected_state(enum state);
> > >  static bool is_admitted_msg(const struct ofpbuf *);
> > >  static bool rconn_logging_connection_attempts__(const struct rconn *rc)
> > >  OVS_REQUIRES(rc->mutex);
> > > -static int rconn_get_version__(const struct rconn *rconn)
> > > -OVS_REQUIRES(rconn->mutex);
> > >
> > >  /* The following prototypes duplicate those in rconn.h, but there we
> > weren't
> > >   * able to add the OVS_EXCLUDED annotations because the definition of
> > struct
> > > @@ -284,7 +283,9 @@ rconn_create(int probe_interval, int max_backoff,
> > uint8_t dscp,
> > >  rconn_set_dscp(rc, dscp);
> > >
> > >  rc->n_monitors = 0;
> > > +
> > >  rc->allowed_versions = allowed_versions;
> > > +rc->version = -1;
> > >
> > >  return rc;
> > >  }
> > > @@ -376,8 +377,7 @@ rconn_connect_unreliably(struct rconn *rc,
> > >  

Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-05-23 Thread Ben Pfaff
On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
> On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:
> >
> > Until now, rconn_get_version() has only reported the OpenFlow version in
> > use when the rconn is actually connected.  This makes sense, but it has a
> > harsh consequence.  Consider code like this:
> >
> > if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
> > for (int i = 0; i < 2; i++) {
> > struct ofpbuf *b = ofputil_encode_echo_request(
> > rconn_get_version(rconn));
> > rconn_send(rconn, b, NULL);
> > }
> > }
> >
> > Maybe not the smartest code in the world, and probably no one would write
> > this exact code in any case, but it doesn't look too risky or crazy.
> >
> > But it is.  The second trip through the loop can assert-fail inside
> > ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1
> > instead of a valid OpenFlow version.  That happens if the first call to
> > rconn_send() encounters an error while sending the message and therefore
> > destroys the underlying vconn and disconnects so that rconn_get_version()
> > doesn't have a vconn to query for its version.
> >
> > In a case like this where all the code to send the messages is close by,
> we
> > could just check rconn_get_version() in each loop iteration.  We could
> even
> > go through the tree and convince ourselves that individual bits of code
> are
> > safe, or be conservative and check rconn_get_version() >= 0 in the iffy
> > cases.  But this seems to me like an ongoing source of risk and a way to
> > get things wrong in corner cases.
> >
> > This commit takes a different approach.  It introduces a new invariant: if
> > an rconn has ever been connected, then it returns a valid OpenFlow version
> > from rconn_get_version().  In addition, if an rconn is currently
> connected,
> > then the OpenFlow version it returns is the correct one (that may be
> > obvious, but there were corner cases before where it returned -1 even
> > though rconn_is_connected() returned true).
> >
> > With this commit, the code above would work OK.  If the first call to
> > rconn_send() encounters an error sending the message, then
> > rconn_get_version() in the second iteration will return the same value as
> > in the first iteration.  The message passed to rconn_send() will end up
> > being discarded, but that's much better than either an assertion failure
> or
> > having to carefully analyze a lot of our code to deal with one unusual
> > corner case.
> >
> > Reported-by: Han Zhou 
> > Signed-off-by: Ben Pfaff 
> > ---
> >  lib/learning-switch.c |  2 +-
> >  lib/rconn.c   | 41 -
> >  lib/vconn.c   |  1 +
> >  3 files changed, 18 insertions(+), 26 deletions(-)
> >
> > diff --git a/lib/learning-switch.c b/lib/learning-switch.c
> > index 04eaa6e8e73f..8102475cae52 100644
> > --- a/lib/learning-switch.c
> > +++ b/lib/learning-switch.c
> > @@ -305,7 +305,7 @@ lswitch_run(struct lswitch *sw)
> >  rconn_run(sw->rconn);
> >
> >  if (sw->state == S_CONNECTING) {
> > -if (rconn_get_version(sw->rconn) != -1) {
> > +if (rconn_is_connected(sw->rconn)) {
> >  lswitch_handshake(sw);
> >  sw->state = S_FEATURES_REPLY;
> >  }
> > diff --git a/lib/rconn.c b/lib/rconn.c
> > index 3cad259d0178..a3f811fedbe3 100644
> > --- a/lib/rconn.c
> > +++ b/lib/rconn.c
> > @@ -141,7 +141,8 @@ struct rconn {
> >  struct vconn *monitors[MAXIMUM_MONITORS];
> >  size_t n_monitors;
> >
> > -uint32_t allowed_versions;
> > +uint32_t allowed_versions;  /* Acceptable OpenFlow versions. */
> > +int version;/* Current or most recent version. */
> >  };
> >
> >  /* Counts packets and bytes queued into an rconn by a given source. */
> > @@ -182,8 +183,6 @@ static bool is_connected_state(enum state);
> >  static bool is_admitted_msg(const struct ofpbuf *);
> >  static bool rconn_logging_connection_attempts__(const struct rconn *rc)
> >  OVS_REQUIRES(rc->mutex);
> > -static int rconn_get_version__(const struct rconn *rconn)
> > -OVS_REQUIRES(rconn->mutex);
> >
> >  /* The following prototypes duplicate those in rconn.h, but there we
> weren't
> >   * able to add the OVS_EXCLUDED annotations because the definition of
> struct
> > @@ -284,7 +283,9 @@ rconn_create(int probe_interval, int max_backoff,
> uint8_t dscp,
> >  rconn_set_dscp(rc, dscp);
> >
> >  rc->n_monitors = 0;
> > +
> >  rc->allowed_versions = allowed_versions;
> > +rc->version = -1;
> >
> >  return rc;
> >  }
> > @@ -376,8 +377,7 @@ rconn_connect_unreliably(struct rconn *rc,
> >  rconn_set_target__(rc, vconn_get_name(vconn), name);
> >  rc->reliable = false;
> >  rc->vconn = vconn;
> > -rc->last_connected = time_now();
> > -state_transition(rc, S_ACTIVE);
> > +state_transition(rc, 

Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-05-23 Thread Han Zhou
On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff  wrote:
>
> Until now, rconn_get_version() has only reported the OpenFlow version in
> use when the rconn is actually connected.  This makes sense, but it has a
> harsh consequence.  Consider code like this:
>
> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
> for (int i = 0; i < 2; i++) {
> struct ofpbuf *b = ofputil_encode_echo_request(
> rconn_get_version(rconn));
> rconn_send(rconn, b, NULL);
> }
> }
>
> Maybe not the smartest code in the world, and probably no one would write
> this exact code in any case, but it doesn't look too risky or crazy.
>
> But it is.  The second trip through the loop can assert-fail inside
> ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1
> instead of a valid OpenFlow version.  That happens if the first call to
> rconn_send() encounters an error while sending the message and therefore
> destroys the underlying vconn and disconnects so that rconn_get_version()
> doesn't have a vconn to query for its version.
>
> In a case like this where all the code to send the messages is close by,
we
> could just check rconn_get_version() in each loop iteration.  We could
even
> go through the tree and convince ourselves that individual bits of code
are
> safe, or be conservative and check rconn_get_version() >= 0 in the iffy
> cases.  But this seems to me like an ongoing source of risk and a way to
> get things wrong in corner cases.
>
> This commit takes a different approach.  It introduces a new invariant: if
> an rconn has ever been connected, then it returns a valid OpenFlow version
> from rconn_get_version().  In addition, if an rconn is currently
connected,
> then the OpenFlow version it returns is the correct one (that may be
> obvious, but there were corner cases before where it returned -1 even
> though rconn_is_connected() returned true).
>
> With this commit, the code above would work OK.  If the first call to
> rconn_send() encounters an error sending the message, then
> rconn_get_version() in the second iteration will return the same value as
> in the first iteration.  The message passed to rconn_send() will end up
> being discarded, but that's much better than either an assertion failure
or
> having to carefully analyze a lot of our code to deal with one unusual
> corner case.
>
> Reported-by: Han Zhou 
> Signed-off-by: Ben Pfaff 
> ---
>  lib/learning-switch.c |  2 +-
>  lib/rconn.c   | 41 -
>  lib/vconn.c   |  1 +
>  3 files changed, 18 insertions(+), 26 deletions(-)
>
> diff --git a/lib/learning-switch.c b/lib/learning-switch.c
> index 04eaa6e8e73f..8102475cae52 100644
> --- a/lib/learning-switch.c
> +++ b/lib/learning-switch.c
> @@ -305,7 +305,7 @@ lswitch_run(struct lswitch *sw)
>  rconn_run(sw->rconn);
>
>  if (sw->state == S_CONNECTING) {
> -if (rconn_get_version(sw->rconn) != -1) {
> +if (rconn_is_connected(sw->rconn)) {
>  lswitch_handshake(sw);
>  sw->state = S_FEATURES_REPLY;
>  }
> diff --git a/lib/rconn.c b/lib/rconn.c
> index 3cad259d0178..a3f811fedbe3 100644
> --- a/lib/rconn.c
> +++ b/lib/rconn.c
> @@ -141,7 +141,8 @@ struct rconn {
>  struct vconn *monitors[MAXIMUM_MONITORS];
>  size_t n_monitors;
>
> -uint32_t allowed_versions;
> +uint32_t allowed_versions;  /* Acceptable OpenFlow versions. */
> +int version;/* Current or most recent version. */
>  };
>
>  /* Counts packets and bytes queued into an rconn by a given source. */
> @@ -182,8 +183,6 @@ static bool is_connected_state(enum state);
>  static bool is_admitted_msg(const struct ofpbuf *);
>  static bool rconn_logging_connection_attempts__(const struct rconn *rc)
>  OVS_REQUIRES(rc->mutex);
> -static int rconn_get_version__(const struct rconn *rconn)
> -OVS_REQUIRES(rconn->mutex);
>
>  /* The following prototypes duplicate those in rconn.h, but there we
weren't
>   * able to add the OVS_EXCLUDED annotations because the definition of
struct
> @@ -284,7 +283,9 @@ rconn_create(int probe_interval, int max_backoff,
uint8_t dscp,
>  rconn_set_dscp(rc, dscp);
>
>  rc->n_monitors = 0;
> +
>  rc->allowed_versions = allowed_versions;
> +rc->version = -1;
>
>  return rc;
>  }
> @@ -376,8 +377,7 @@ rconn_connect_unreliably(struct rconn *rc,
>  rconn_set_target__(rc, vconn_get_name(vconn), name);
>  rc->reliable = false;
>  rc->vconn = vconn;
> -rc->last_connected = time_now();
> -state_transition(rc, S_ACTIVE);
> +state_transition(rc, S_CONNECTING);
>  ovs_mutex_unlock(>mutex);
>  }
>
> @@ -514,6 +514,7 @@ run_CONNECTING(struct rconn *rc)
>  VLOG_INFO("%s: connected", rc->name);
>  rc->n_successful_connections++;
>  state_transition(rc, S_ACTIVE);
> +rc->version = 

[ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

2018-05-23 Thread Ben Pfaff
Until now, rconn_get_version() has only reported the OpenFlow version in
use when the rconn is actually connected.  This makes sense, but it has a
harsh consequence.  Consider code like this:

if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
for (int i = 0; i < 2; i++) {
struct ofpbuf *b = ofputil_encode_echo_request(
rconn_get_version(rconn));
rconn_send(rconn, b, NULL);
}
}

Maybe not the smartest code in the world, and probably no one would write
this exact code in any case, but it doesn't look too risky or crazy.

But it is.  The second trip through the loop can assert-fail inside
ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1
instead of a valid OpenFlow version.  That happens if the first call to
rconn_send() encounters an error while sending the message and therefore
destroys the underlying vconn and disconnects so that rconn_get_version()
doesn't have a vconn to query for its version.

In a case like this where all the code to send the messages is close by, we
could just check rconn_get_version() in each loop iteration.  We could even
go through the tree and convince ourselves that individual bits of code are
safe, or be conservative and check rconn_get_version() >= 0 in the iffy
cases.  But this seems to me like an ongoing source of risk and a way to
get things wrong in corner cases.

This commit takes a different approach.  It introduces a new invariant: if
an rconn has ever been connected, then it returns a valid OpenFlow version
from rconn_get_version().  In addition, if an rconn is currently connected,
then the OpenFlow version it returns is the correct one (that may be
obvious, but there were corner cases before where it returned -1 even
though rconn_is_connected() returned true).

With this commit, the code above would work OK.  If the first call to
rconn_send() encounters an error sending the message, then
rconn_get_version() in the second iteration will return the same value as
in the first iteration.  The message passed to rconn_send() will end up
being discarded, but that's much better than either an assertion failure or
having to carefully analyze a lot of our code to deal with one unusual
corner case.

Reported-by: Han Zhou 
Signed-off-by: Ben Pfaff 
---
 lib/learning-switch.c |  2 +-
 lib/rconn.c   | 41 -
 lib/vconn.c   |  1 +
 3 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/lib/learning-switch.c b/lib/learning-switch.c
index 04eaa6e8e73f..8102475cae52 100644
--- a/lib/learning-switch.c
+++ b/lib/learning-switch.c
@@ -305,7 +305,7 @@ lswitch_run(struct lswitch *sw)
 rconn_run(sw->rconn);
 
 if (sw->state == S_CONNECTING) {
-if (rconn_get_version(sw->rconn) != -1) {
+if (rconn_is_connected(sw->rconn)) {
 lswitch_handshake(sw);
 sw->state = S_FEATURES_REPLY;
 }
diff --git a/lib/rconn.c b/lib/rconn.c
index 3cad259d0178..a3f811fedbe3 100644
--- a/lib/rconn.c
+++ b/lib/rconn.c
@@ -141,7 +141,8 @@ struct rconn {
 struct vconn *monitors[MAXIMUM_MONITORS];
 size_t n_monitors;
 
-uint32_t allowed_versions;
+uint32_t allowed_versions;  /* Acceptable OpenFlow versions. */
+int version;/* Current or most recent version. */
 };
 
 /* Counts packets and bytes queued into an rconn by a given source. */
@@ -182,8 +183,6 @@ static bool is_connected_state(enum state);
 static bool is_admitted_msg(const struct ofpbuf *);
 static bool rconn_logging_connection_attempts__(const struct rconn *rc)
 OVS_REQUIRES(rc->mutex);
-static int rconn_get_version__(const struct rconn *rconn)
-OVS_REQUIRES(rconn->mutex);
 
 /* The following prototypes duplicate those in rconn.h, but there we weren't
  * able to add the OVS_EXCLUDED annotations because the definition of struct
@@ -284,7 +283,9 @@ rconn_create(int probe_interval, int max_backoff, uint8_t 
dscp,
 rconn_set_dscp(rc, dscp);
 
 rc->n_monitors = 0;
+
 rc->allowed_versions = allowed_versions;
+rc->version = -1;
 
 return rc;
 }
@@ -376,8 +377,7 @@ rconn_connect_unreliably(struct rconn *rc,
 rconn_set_target__(rc, vconn_get_name(vconn), name);
 rc->reliable = false;
 rc->vconn = vconn;
-rc->last_connected = time_now();
-state_transition(rc, S_ACTIVE);
+state_transition(rc, S_CONNECTING);
 ovs_mutex_unlock(>mutex);
 }
 
@@ -514,6 +514,7 @@ run_CONNECTING(struct rconn *rc)
 VLOG_INFO("%s: connected", rc->name);
 rc->n_successful_connections++;
 state_transition(rc, S_ACTIVE);
+rc->version = vconn_get_version(rc->vconn);
 rc->last_connected = rc->state_entered;
 } else if (retval != EAGAIN) {
 if (rconn_logging_connection_attempts__(rc)) {
@@ -575,14 +576,8 @@ run_ACTIVE(struct rconn *rc)
  * can end up queuing a packet with vconn == NULL and then