Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
On 19.06.2018 19:03, Ben Pfaff wrote: > On Tue, Jun 19, 2018 at 10:58:36AM +0300, Ilya Maximets wrote: >> On 18.06.2018 22:24, Ben Pfaff wrote: >>> On Mon, Jun 18, 2018 at 06:34:13PM +0300, Ilya Maximets wrote: On 18.06.2018 18:07, Ben Pfaff wrote: > On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote: >>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote: On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff wrote: >> >> Until now, rconn_get_version() has only reported the OpenFlow >> version in >> use when the rconn is actually connected. This makes sense, but it >> has a >> harsh consequence. Consider code like this: >> >> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) { >> for (int i = 0; i < 2; i++) { >> struct ofpbuf *b = ofputil_encode_echo_request( >> rconn_get_version(rconn)); >> rconn_send(rconn, b, NULL); >> } >> } >> >> Maybe not the smartest code in the world, and probably no one would >> write >> this exact code in any case, but it doesn't look too risky or crazy. >> >> But it is. The second trip through the loop can assert-fail inside >> ofputil_encode_echo_request() because rconn_get_version(rconn) >> returns -1 >> instead of a valid OpenFlow version. That happens if the first call >> to >> rconn_send() encounters an error while sending the message and >> therefore >> destroys the underlying vconn and disconnects so that >> rconn_get_version() >> doesn't have a vconn to query for its version. >> >> In a case like this where all the code to send the messages is close >> by, > we >> could just check rconn_get_version() in each loop iteration. We >> could > even >> go through the tree and convince ourselves that individual bits of >> code > are >> safe, or be conservative and check rconn_get_version() >= 0 in the >> iffy >> cases. But this seems to me like an ongoing source of risk and a >> way to >> get things wrong in corner cases. >> >> This commit takes a different approach. It introduces a new >> invariant: if >> an rconn has ever been connected, then it returns a valid OpenFlow >> version >> from rconn_get_version(). In addition, if an rconn is currently > connected, >> then the OpenFlow version it returns is the correct one (that may be >> obvious, but there were corner cases before where it returned -1 even >> though rconn_is_connected() returned true). >> >> With this commit, the code above would work OK. If the first call to >> rconn_send() encounters an error sending the message, then >> rconn_get_version() in the second iteration will return the same >> value as >> in the first iteration. The message passed to rconn_send() will end >> up >> being discarded, but that's much better than either an assertion >> failure > or >> having to carefully analyze a lot of our code to deal with one >> unusual >> corner case. >> >> Reported-by: Han Zhou >> Signed-off-by: Ben Pfaff >> --- >> lib/learning-switch.c | 2 +- >> lib/rconn.c | 41 - >> lib/vconn.c | 1 + >> 3 files changed, 18 insertions(+), 26 deletions(-) >> > Acked-by: Han Zhou Thanks. I applied this to master. I'll backport it to older versions if no one notices trouble soon. >>> >>> I backported as far as branch-2.5. >> >> Sorry for being late, but I just tried to use branch-2.9 on my test >> environment >> and found that this patch produces a lot of "connected" logs that are a >> bit annoying: >> >> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected >> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected >> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected >> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected >> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected >> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected >> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected >> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected >> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
On Tue, Jun 19, 2018 at 10:58:36AM +0300, Ilya Maximets wrote: > On 18.06.2018 22:24, Ben Pfaff wrote: > > On Mon, Jun 18, 2018 at 06:34:13PM +0300, Ilya Maximets wrote: > >> On 18.06.2018 18:07, Ben Pfaff wrote: > >>> On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote: > > On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote: > >> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: > >>> On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff wrote: > > Until now, rconn_get_version() has only reported the OpenFlow > version in > use when the rconn is actually connected. This makes sense, but it > has a > harsh consequence. Consider code like this: > > if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) { > for (int i = 0; i < 2; i++) { > struct ofpbuf *b = ofputil_encode_echo_request( > rconn_get_version(rconn)); > rconn_send(rconn, b, NULL); > } > } > > Maybe not the smartest code in the world, and probably no one would > write > this exact code in any case, but it doesn't look too risky or crazy. > > But it is. The second trip through the loop can assert-fail inside > ofputil_encode_echo_request() because rconn_get_version(rconn) > returns -1 > instead of a valid OpenFlow version. That happens if the first call > to > rconn_send() encounters an error while sending the message and > therefore > destroys the underlying vconn and disconnects so that > rconn_get_version() > doesn't have a vconn to query for its version. > > In a case like this where all the code to send the messages is close > by, > >>> we > could just check rconn_get_version() in each loop iteration. We > could > >>> even > go through the tree and convince ourselves that individual bits of > code > >>> are > safe, or be conservative and check rconn_get_version() >= 0 in the > iffy > cases. But this seems to me like an ongoing source of risk and a > way to > get things wrong in corner cases. > > This commit takes a different approach. It introduces a new > invariant: if > an rconn has ever been connected, then it returns a valid OpenFlow > version > from rconn_get_version(). In addition, if an rconn is currently > >>> connected, > then the OpenFlow version it returns is the correct one (that may be > obvious, but there were corner cases before where it returned -1 even > though rconn_is_connected() returned true). > > With this commit, the code above would work OK. If the first call to > rconn_send() encounters an error sending the message, then > rconn_get_version() in the second iteration will return the same > value as > in the first iteration. The message passed to rconn_send() will end > up > being discarded, but that's much better than either an assertion > failure > >>> or > having to carefully analyze a lot of our code to deal with one > unusual > corner case. > > Reported-by: Han Zhou > Signed-off-by: Ben Pfaff > --- > lib/learning-switch.c | 2 +- > lib/rconn.c | 41 - > lib/vconn.c | 1 + > 3 files changed, 18 insertions(+), 26 deletions(-) > > >>> Acked-by: Han Zhou > >> > >> Thanks. I applied this to master. I'll backport it to older versions > >> if no one notices trouble soon. > > > > I backported as far as branch-2.5. > > Sorry for being late, but I just tried to use branch-2.9 on my test > environment > and found that this patch produces a lot of "connected" logs that are a > bit annoying: > > 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected > 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected > 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected > 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected > 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected > 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected > 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected > 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected > 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected >
Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
On 18.06.2018 22:24, Ben Pfaff wrote: > On Mon, Jun 18, 2018 at 06:34:13PM +0300, Ilya Maximets wrote: >> On 18.06.2018 18:07, Ben Pfaff wrote: >>> On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote: > On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote: >> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: >>> On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff wrote: Until now, rconn_get_version() has only reported the OpenFlow version in use when the rconn is actually connected. This makes sense, but it has a harsh consequence. Consider code like this: if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) { for (int i = 0; i < 2; i++) { struct ofpbuf *b = ofputil_encode_echo_request( rconn_get_version(rconn)); rconn_send(rconn, b, NULL); } } Maybe not the smartest code in the world, and probably no one would write this exact code in any case, but it doesn't look too risky or crazy. But it is. The second trip through the loop can assert-fail inside ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1 instead of a valid OpenFlow version. That happens if the first call to rconn_send() encounters an error while sending the message and therefore destroys the underlying vconn and disconnects so that rconn_get_version() doesn't have a vconn to query for its version. In a case like this where all the code to send the messages is close by, >>> we could just check rconn_get_version() in each loop iteration. We could >>> even go through the tree and convince ourselves that individual bits of code >>> are safe, or be conservative and check rconn_get_version() >= 0 in the iffy cases. But this seems to me like an ongoing source of risk and a way to get things wrong in corner cases. This commit takes a different approach. It introduces a new invariant: if an rconn has ever been connected, then it returns a valid OpenFlow version from rconn_get_version(). In addition, if an rconn is currently >>> connected, then the OpenFlow version it returns is the correct one (that may be obvious, but there were corner cases before where it returned -1 even though rconn_is_connected() returned true). With this commit, the code above would work OK. If the first call to rconn_send() encounters an error sending the message, then rconn_get_version() in the second iteration will return the same value as in the first iteration. The message passed to rconn_send() will end up being discarded, but that's much better than either an assertion failure >>> or having to carefully analyze a lot of our code to deal with one unusual corner case. Reported-by: Han Zhou Signed-off-by: Ben Pfaff --- lib/learning-switch.c | 2 +- lib/rconn.c | 41 - lib/vconn.c | 1 + 3 files changed, 18 insertions(+), 26 deletions(-) >>> Acked-by: Han Zhou >> >> Thanks. I applied this to master. I'll backport it to older versions >> if no one notices trouble soon. > > I backported as far as branch-2.5. Sorry for being late, but I just tried to use branch-2.9 on my test environment and found that this patch produces a lot of "connected" logs that are a bit annoying: 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886:
Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
On 18.06.2018 22:25, Ben Pfaff wrote: > On Mon, Jun 18, 2018 at 09:31:43AM -0700, Han Zhou wrote: >> On Mon, Jun 18, 2018 at 8:34 AM, Ilya Maximets >> wrote: >>> >>> On 18.06.2018 18:07, Ben Pfaff wrote: On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote: >> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote: >>> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff wrote: > > Until now, rconn_get_version() has only reported the OpenFlow >> version in > use when the rconn is actually connected. This makes sense, but >> it has a > harsh consequence. Consider code like this: > > if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= >> 0) { > for (int i = 0; i < 2; i++) { > struct ofpbuf *b = ofputil_encode_echo_request( > rconn_get_version(rconn)); > rconn_send(rconn, b, NULL); > } > } > > Maybe not the smartest code in the world, and probably no one >> would write > this exact code in any case, but it doesn't look too risky or >> crazy. > > But it is. The second trip through the loop can assert-fail inside > ofputil_encode_echo_request() because rconn_get_version(rconn) >> returns -1 > instead of a valid OpenFlow version. That happens if the first >> call to > rconn_send() encounters an error while sending the message and >> therefore > destroys the underlying vconn and disconnects so that >> rconn_get_version() > doesn't have a vconn to query for its version. > > In a case like this where all the code to send the messages is >> close by, we > could just check rconn_get_version() in each loop iteration. We >> could even > go through the tree and convince ourselves that individual bits of >> code are > safe, or be conservative and check rconn_get_version() >= 0 in the >> iffy > cases. But this seems to me like an ongoing source of risk and a >> way to > get things wrong in corner cases. > > This commit takes a different approach. It introduces a new >> invariant: if > an rconn has ever been connected, then it returns a valid OpenFlow >> version > from rconn_get_version(). In addition, if an rconn is currently connected, > then the OpenFlow version it returns is the correct one (that may >> be > obvious, but there were corner cases before where it returned -1 >> even > though rconn_is_connected() returned true). > > With this commit, the code above would work OK. If the first call >> to > rconn_send() encounters an error sending the message, then > rconn_get_version() in the second iteration will return the same >> value as > in the first iteration. The message passed to rconn_send() will >> end up > being discarded, but that's much better than either an assertion >> failure or > having to carefully analyze a lot of our code to deal with one >> unusual > corner case. > > Reported-by: Han Zhou > Signed-off-by: Ben Pfaff > --- > lib/learning-switch.c | 2 +- > lib/rconn.c | 41 >> - > lib/vconn.c | 1 + > 3 files changed, 18 insertions(+), 26 deletions(-) > Acked-by: Han Zhou >>> >>> Thanks. I applied this to master. I'll backport it to older >> versions >>> if no one notices trouble soon. >> >> I backported as far as branch-2.5. > > Sorry for being late, but I just tried to use branch-2.9 on my test >> environment > and found that this patch produces a lot of "connected" logs that are >> a bit annoying: > > 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected > 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected > 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected > 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected > 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected > 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected > 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected > 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected > 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected > 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected > 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected > 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected >
Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
On Mon, Jun 18, 2018 at 09:31:43AM -0700, Han Zhou wrote: > On Mon, Jun 18, 2018 at 8:34 AM, Ilya Maximets > wrote: > > > > On 18.06.2018 18:07, Ben Pfaff wrote: > > > On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote: > > >>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote: > > On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: > > > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff wrote: > > >> > > >> Until now, rconn_get_version() has only reported the OpenFlow > version in > > >> use when the rconn is actually connected. This makes sense, but > it has a > > >> harsh consequence. Consider code like this: > > >> > > >> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= > 0) { > > >> for (int i = 0; i < 2; i++) { > > >> struct ofpbuf *b = ofputil_encode_echo_request( > > >> rconn_get_version(rconn)); > > >> rconn_send(rconn, b, NULL); > > >> } > > >> } > > >> > > >> Maybe not the smartest code in the world, and probably no one > would write > > >> this exact code in any case, but it doesn't look too risky or > crazy. > > >> > > >> But it is. The second trip through the loop can assert-fail inside > > >> ofputil_encode_echo_request() because rconn_get_version(rconn) > returns -1 > > >> instead of a valid OpenFlow version. That happens if the first > call to > > >> rconn_send() encounters an error while sending the message and > therefore > > >> destroys the underlying vconn and disconnects so that > rconn_get_version() > > >> doesn't have a vconn to query for its version. > > >> > > >> In a case like this where all the code to send the messages is > close by, > > > we > > >> could just check rconn_get_version() in each loop iteration. We > could > > > even > > >> go through the tree and convince ourselves that individual bits of > code > > > are > > >> safe, or be conservative and check rconn_get_version() >= 0 in the > iffy > > >> cases. But this seems to me like an ongoing source of risk and a > way to > > >> get things wrong in corner cases. > > >> > > >> This commit takes a different approach. It introduces a new > invariant: if > > >> an rconn has ever been connected, then it returns a valid OpenFlow > version > > >> from rconn_get_version(). In addition, if an rconn is currently > > > connected, > > >> then the OpenFlow version it returns is the correct one (that may > be > > >> obvious, but there were corner cases before where it returned -1 > even > > >> though rconn_is_connected() returned true). > > >> > > >> With this commit, the code above would work OK. If the first call > to > > >> rconn_send() encounters an error sending the message, then > > >> rconn_get_version() in the second iteration will return the same > value as > > >> in the first iteration. The message passed to rconn_send() will > end up > > >> being discarded, but that's much better than either an assertion > failure > > > or > > >> having to carefully analyze a lot of our code to deal with one > unusual > > >> corner case. > > >> > > >> Reported-by: Han Zhou > > >> Signed-off-by: Ben Pfaff > > >> --- > > >> lib/learning-switch.c | 2 +- > > >> lib/rconn.c | 41 > - > > >> lib/vconn.c | 1 + > > >> 3 files changed, 18 insertions(+), 26 deletions(-) > > >> > > > Acked-by: Han Zhou > > > > Thanks. I applied this to master. I'll backport it to older > versions > > if no one notices trouble soon. > > >>> > > >>> I backported as far as branch-2.5. > > >> > > >> Sorry for being late, but I just tried to use branch-2.9 on my test > environment > > >> and found that this patch produces a lot of "connected" logs that are > a bit annoying: > > >> > > >> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected > > >> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected > > >> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected > > >> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected > > >> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected > > >> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected > > >> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected > > >> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected > > >> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected > > >> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected > > >> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected > > >> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected > > >>
Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
On Mon, Jun 18, 2018 at 06:34:13PM +0300, Ilya Maximets wrote: > On 18.06.2018 18:07, Ben Pfaff wrote: > > On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote: > >>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote: > On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: > > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff wrote: > >> > >> Until now, rconn_get_version() has only reported the OpenFlow version > >> in > >> use when the rconn is actually connected. This makes sense, but it > >> has a > >> harsh consequence. Consider code like this: > >> > >> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) { > >> for (int i = 0; i < 2; i++) { > >> struct ofpbuf *b = ofputil_encode_echo_request( > >> rconn_get_version(rconn)); > >> rconn_send(rconn, b, NULL); > >> } > >> } > >> > >> Maybe not the smartest code in the world, and probably no one would > >> write > >> this exact code in any case, but it doesn't look too risky or crazy. > >> > >> But it is. The second trip through the loop can assert-fail inside > >> ofputil_encode_echo_request() because rconn_get_version(rconn) returns > >> -1 > >> instead of a valid OpenFlow version. That happens if the first call to > >> rconn_send() encounters an error while sending the message and > >> therefore > >> destroys the underlying vconn and disconnects so that > >> rconn_get_version() > >> doesn't have a vconn to query for its version. > >> > >> In a case like this where all the code to send the messages is close > >> by, > > we > >> could just check rconn_get_version() in each loop iteration. We could > > even > >> go through the tree and convince ourselves that individual bits of code > > are > >> safe, or be conservative and check rconn_get_version() >= 0 in the iffy > >> cases. But this seems to me like an ongoing source of risk and a way > >> to > >> get things wrong in corner cases. > >> > >> This commit takes a different approach. It introduces a new > >> invariant: if > >> an rconn has ever been connected, then it returns a valid OpenFlow > >> version > >> from rconn_get_version(). In addition, if an rconn is currently > > connected, > >> then the OpenFlow version it returns is the correct one (that may be > >> obvious, but there were corner cases before where it returned -1 even > >> though rconn_is_connected() returned true). > >> > >> With this commit, the code above would work OK. If the first call to > >> rconn_send() encounters an error sending the message, then > >> rconn_get_version() in the second iteration will return the same value > >> as > >> in the first iteration. The message passed to rconn_send() will end up > >> being discarded, but that's much better than either an assertion > >> failure > > or > >> having to carefully analyze a lot of our code to deal with one unusual > >> corner case. > >> > >> Reported-by: Han Zhou > >> Signed-off-by: Ben Pfaff > >> --- > >> lib/learning-switch.c | 2 +- > >> lib/rconn.c | 41 - > >> lib/vconn.c | 1 + > >> 3 files changed, 18 insertions(+), 26 deletions(-) > >> > > Acked-by: Han Zhou > > Thanks. I applied this to master. I'll backport it to older versions > if no one notices trouble soon. > >>> > >>> I backported as far as branch-2.5. > >> > >> Sorry for being late, but I just tried to use branch-2.9 on my test > >> environment > >> and found that this patch produces a lot of "connected" logs that are a > >> bit annoying: > >> > >> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected > >> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected > >> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected > >> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected > >> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected > >> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected > >> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected > >> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected > >> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected > >> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected > >> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected > >> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected > >> 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected > >> 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected > >>
Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
On Mon, Jun 18, 2018 at 8:34 AM, Ilya Maximets wrote: > > On 18.06.2018 18:07, Ben Pfaff wrote: > > On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote: > >>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote: > On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: > > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff wrote: > >> > >> Until now, rconn_get_version() has only reported the OpenFlow version in > >> use when the rconn is actually connected. This makes sense, but it has a > >> harsh consequence. Consider code like this: > >> > >> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) { > >> for (int i = 0; i < 2; i++) { > >> struct ofpbuf *b = ofputil_encode_echo_request( > >> rconn_get_version(rconn)); > >> rconn_send(rconn, b, NULL); > >> } > >> } > >> > >> Maybe not the smartest code in the world, and probably no one would write > >> this exact code in any case, but it doesn't look too risky or crazy. > >> > >> But it is. The second trip through the loop can assert-fail inside > >> ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1 > >> instead of a valid OpenFlow version. That happens if the first call to > >> rconn_send() encounters an error while sending the message and therefore > >> destroys the underlying vconn and disconnects so that rconn_get_version() > >> doesn't have a vconn to query for its version. > >> > >> In a case like this where all the code to send the messages is close by, > > we > >> could just check rconn_get_version() in each loop iteration. We could > > even > >> go through the tree and convince ourselves that individual bits of code > > are > >> safe, or be conservative and check rconn_get_version() >= 0 in the iffy > >> cases. But this seems to me like an ongoing source of risk and a way to > >> get things wrong in corner cases. > >> > >> This commit takes a different approach. It introduces a new invariant: if > >> an rconn has ever been connected, then it returns a valid OpenFlow version > >> from rconn_get_version(). In addition, if an rconn is currently > > connected, > >> then the OpenFlow version it returns is the correct one (that may be > >> obvious, but there were corner cases before where it returned -1 even > >> though rconn_is_connected() returned true). > >> > >> With this commit, the code above would work OK. If the first call to > >> rconn_send() encounters an error sending the message, then > >> rconn_get_version() in the second iteration will return the same value as > >> in the first iteration. The message passed to rconn_send() will end up > >> being discarded, but that's much better than either an assertion failure > > or > >> having to carefully analyze a lot of our code to deal with one unusual > >> corner case. > >> > >> Reported-by: Han Zhou > >> Signed-off-by: Ben Pfaff > >> --- > >> lib/learning-switch.c | 2 +- > >> lib/rconn.c | 41 - > >> lib/vconn.c | 1 + > >> 3 files changed, 18 insertions(+), 26 deletions(-) > >> > > Acked-by: Han Zhou > > Thanks. I applied this to master. I'll backport it to older versions > if no one notices trouble soon. > >>> > >>> I backported as far as branch-2.5. > >> > >> Sorry for being late, but I just tried to use branch-2.9 on my test environment > >> and found that this patch produces a lot of "connected" logs that are a bit annoying: > >> > >> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected > >> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected > >> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected > >> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected > >> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected > >> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected > >> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected > >> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected > >> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected > >> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected > >> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected > >> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected > >> 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected > >> 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected > >> 2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected > >> > >> On my system it's about 2 log messages on each 2 seconds. It has only > >>
Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
On 18.06.2018 18:07, Ben Pfaff wrote: > On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote: >>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote: On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff wrote: >> >> Until now, rconn_get_version() has only reported the OpenFlow version in >> use when the rconn is actually connected. This makes sense, but it has a >> harsh consequence. Consider code like this: >> >> if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) { >> for (int i = 0; i < 2; i++) { >> struct ofpbuf *b = ofputil_encode_echo_request( >> rconn_get_version(rconn)); >> rconn_send(rconn, b, NULL); >> } >> } >> >> Maybe not the smartest code in the world, and probably no one would write >> this exact code in any case, but it doesn't look too risky or crazy. >> >> But it is. The second trip through the loop can assert-fail inside >> ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1 >> instead of a valid OpenFlow version. That happens if the first call to >> rconn_send() encounters an error while sending the message and therefore >> destroys the underlying vconn and disconnects so that rconn_get_version() >> doesn't have a vconn to query for its version. >> >> In a case like this where all the code to send the messages is close by, > we >> could just check rconn_get_version() in each loop iteration. We could > even >> go through the tree and convince ourselves that individual bits of code > are >> safe, or be conservative and check rconn_get_version() >= 0 in the iffy >> cases. But this seems to me like an ongoing source of risk and a way to >> get things wrong in corner cases. >> >> This commit takes a different approach. It introduces a new invariant: >> if >> an rconn has ever been connected, then it returns a valid OpenFlow >> version >> from rconn_get_version(). In addition, if an rconn is currently > connected, >> then the OpenFlow version it returns is the correct one (that may be >> obvious, but there were corner cases before where it returned -1 even >> though rconn_is_connected() returned true). >> >> With this commit, the code above would work OK. If the first call to >> rconn_send() encounters an error sending the message, then >> rconn_get_version() in the second iteration will return the same value as >> in the first iteration. The message passed to rconn_send() will end up >> being discarded, but that's much better than either an assertion failure > or >> having to carefully analyze a lot of our code to deal with one unusual >> corner case. >> >> Reported-by: Han Zhou >> Signed-off-by: Ben Pfaff >> --- >> lib/learning-switch.c | 2 +- >> lib/rconn.c | 41 - >> lib/vconn.c | 1 + >> 3 files changed, 18 insertions(+), 26 deletions(-) >> > Acked-by: Han Zhou Thanks. I applied this to master. I'll backport it to older versions if no one notices trouble soon. >>> >>> I backported as far as branch-2.5. >> >> Sorry for being late, but I just tried to use branch-2.9 on my test >> environment >> and found that this patch produces a lot of "connected" logs that are a bit >> annoying: >> >> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected >> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected >> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected >> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected >> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected >> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected >> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected >> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected >> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected >> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected >> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected >> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected >> 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected >> 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected >> 2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected >> >> On my system it's about 2 log messages on each 2 seconds. It has only >> ovsdb-server, ovs-vswitchd and ovsdb-client monitor running. >> >> Could you please check? > > Are you sure that it's this patch that makes a difference? It doesn't > change anything in the logging. I don't know much
Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote: > > On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote: > >> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: > >> > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff wrote: > >> > > > >> > > Until now, rconn_get_version() has only reported the OpenFlow version > >> > > in > >> > > use when the rconn is actually connected. This makes sense, but it > >> > > has a > >> > > harsh consequence. Consider code like this: > >> > > > >> > > if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) { > >> > > for (int i = 0; i < 2; i++) { > >> > > struct ofpbuf *b = ofputil_encode_echo_request( > >> > > rconn_get_version(rconn)); > >> > > rconn_send(rconn, b, NULL); > >> > > } > >> > > } > >> > > > >> > > Maybe not the smartest code in the world, and probably no one would > >> > > write > >> > > this exact code in any case, but it doesn't look too risky or crazy. > >> > > > >> > > But it is. The second trip through the loop can assert-fail inside > >> > > ofputil_encode_echo_request() because rconn_get_version(rconn) returns > >> > > -1 > >> > > instead of a valid OpenFlow version. That happens if the first call to > >> > > rconn_send() encounters an error while sending the message and > >> > > therefore > >> > > destroys the underlying vconn and disconnects so that > >> > > rconn_get_version() > >> > > doesn't have a vconn to query for its version. > >> > > > >> > > In a case like this where all the code to send the messages is close > >> > > by, > >> > we > >> > > could just check rconn_get_version() in each loop iteration. We could > >> > even > >> > > go through the tree and convince ourselves that individual bits of code > >> > are > >> > > safe, or be conservative and check rconn_get_version() >= 0 in the iffy > >> > > cases. But this seems to me like an ongoing source of risk and a way > >> > > to > >> > > get things wrong in corner cases. > >> > > > >> > > This commit takes a different approach. It introduces a new > >> > > invariant: if > >> > > an rconn has ever been connected, then it returns a valid OpenFlow > >> > > version > >> > > from rconn_get_version(). In addition, if an rconn is currently > >> > connected, > >> > > then the OpenFlow version it returns is the correct one (that may be > >> > > obvious, but there were corner cases before where it returned -1 even > >> > > though rconn_is_connected() returned true). > >> > > > >> > > With this commit, the code above would work OK. If the first call to > >> > > rconn_send() encounters an error sending the message, then > >> > > rconn_get_version() in the second iteration will return the same value > >> > > as > >> > > in the first iteration. The message passed to rconn_send() will end up > >> > > being discarded, but that's much better than either an assertion > >> > > failure > >> > or > >> > > having to carefully analyze a lot of our code to deal with one unusual > >> > > corner case. > >> > > > >> > > Reported-by: Han Zhou > >> > > Signed-off-by: Ben Pfaff > >> > > --- > >> > > lib/learning-switch.c | 2 +- > >> > > lib/rconn.c | 41 - > >> > > lib/vconn.c | 1 + > >> > > 3 files changed, 18 insertions(+), 26 deletions(-) > >> > > > >> > Acked-by: Han Zhou > >> > >> Thanks. I applied this to master. I'll backport it to older versions > >> if no one notices trouble soon. > > > > I backported as far as branch-2.5. > > Sorry for being late, but I just tried to use branch-2.9 on my test > environment > and found that this patch produces a lot of "connected" logs that are a bit > annoying: > > 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected > 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected > 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected > 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected > 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected > 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected > 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected > 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected > 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected > 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected > 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected > 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected > 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected > 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected > 2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected > > On my system it's about 2 log messages on each 2 seconds. It has only > ovsdb-server, ovs-vswitchd and ovsdb-client monitor running. >
Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote: >> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: >> > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff wrote: >> > > >> > > Until now, rconn_get_version() has only reported the OpenFlow version in >> > > use when the rconn is actually connected. This makes sense, but it has a >> > > harsh consequence. Consider code like this: >> > > >> > > if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) { >> > > for (int i = 0; i < 2; i++) { >> > > struct ofpbuf *b = ofputil_encode_echo_request( >> > > rconn_get_version(rconn)); >> > > rconn_send(rconn, b, NULL); >> > > } >> > > } >> > > >> > > Maybe not the smartest code in the world, and probably no one would write >> > > this exact code in any case, but it doesn't look too risky or crazy. >> > > >> > > But it is. The second trip through the loop can assert-fail inside >> > > ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1 >> > > instead of a valid OpenFlow version. That happens if the first call to >> > > rconn_send() encounters an error while sending the message and therefore >> > > destroys the underlying vconn and disconnects so that rconn_get_version() >> > > doesn't have a vconn to query for its version. >> > > >> > > In a case like this where all the code to send the messages is close by, >> > we >> > > could just check rconn_get_version() in each loop iteration. We could >> > even >> > > go through the tree and convince ourselves that individual bits of code >> > are >> > > safe, or be conservative and check rconn_get_version() >= 0 in the iffy >> > > cases. But this seems to me like an ongoing source of risk and a way to >> > > get things wrong in corner cases. >> > > >> > > This commit takes a different approach. It introduces a new invariant: >> > > if >> > > an rconn has ever been connected, then it returns a valid OpenFlow >> > > version >> > > from rconn_get_version(). In addition, if an rconn is currently >> > connected, >> > > then the OpenFlow version it returns is the correct one (that may be >> > > obvious, but there were corner cases before where it returned -1 even >> > > though rconn_is_connected() returned true). >> > > >> > > With this commit, the code above would work OK. If the first call to >> > > rconn_send() encounters an error sending the message, then >> > > rconn_get_version() in the second iteration will return the same value as >> > > in the first iteration. The message passed to rconn_send() will end up >> > > being discarded, but that's much better than either an assertion failure >> > or >> > > having to carefully analyze a lot of our code to deal with one unusual >> > > corner case. >> > > >> > > Reported-by: Han Zhou >> > > Signed-off-by: Ben Pfaff >> > > --- >> > > lib/learning-switch.c | 2 +- >> > > lib/rconn.c | 41 - >> > > lib/vconn.c | 1 + >> > > 3 files changed, 18 insertions(+), 26 deletions(-) >> > > >> > Acked-by: Han Zhou >> >> Thanks. I applied this to master. I'll backport it to older versions >> if no one notices trouble soon. > > I backported as far as branch-2.5. Sorry for being late, but I just tried to use branch-2.9 on my test environment and found that this patch produces a lot of "connected" logs that are a bit annoying: 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected 2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected On my system it's about 2 log messages on each 2 seconds. It has only ovsdb-server, ovs-vswitchd and ovsdb-client monitor running. Could you please check? Best regards, Ilya Maximets. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote: > On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: > > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff wrote: > > > > > > Until now, rconn_get_version() has only reported the OpenFlow version in > > > use when the rconn is actually connected. This makes sense, but it has a > > > harsh consequence. Consider code like this: > > > > > > if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) { > > > for (int i = 0; i < 2; i++) { > > > struct ofpbuf *b = ofputil_encode_echo_request( > > > rconn_get_version(rconn)); > > > rconn_send(rconn, b, NULL); > > > } > > > } > > > > > > Maybe not the smartest code in the world, and probably no one would write > > > this exact code in any case, but it doesn't look too risky or crazy. > > > > > > But it is. The second trip through the loop can assert-fail inside > > > ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1 > > > instead of a valid OpenFlow version. That happens if the first call to > > > rconn_send() encounters an error while sending the message and therefore > > > destroys the underlying vconn and disconnects so that rconn_get_version() > > > doesn't have a vconn to query for its version. > > > > > > In a case like this where all the code to send the messages is close by, > > we > > > could just check rconn_get_version() in each loop iteration. We could > > even > > > go through the tree and convince ourselves that individual bits of code > > are > > > safe, or be conservative and check rconn_get_version() >= 0 in the iffy > > > cases. But this seems to me like an ongoing source of risk and a way to > > > get things wrong in corner cases. > > > > > > This commit takes a different approach. It introduces a new invariant: if > > > an rconn has ever been connected, then it returns a valid OpenFlow version > > > from rconn_get_version(). In addition, if an rconn is currently > > connected, > > > then the OpenFlow version it returns is the correct one (that may be > > > obvious, but there were corner cases before where it returned -1 even > > > though rconn_is_connected() returned true). > > > > > > With this commit, the code above would work OK. If the first call to > > > rconn_send() encounters an error sending the message, then > > > rconn_get_version() in the second iteration will return the same value as > > > in the first iteration. The message passed to rconn_send() will end up > > > being discarded, but that's much better than either an assertion failure > > or > > > having to carefully analyze a lot of our code to deal with one unusual > > > corner case. > > > > > > Reported-by: Han Zhou > > > Signed-off-by: Ben Pfaff > > > --- > > > lib/learning-switch.c | 2 +- > > > lib/rconn.c | 41 - > > > lib/vconn.c | 1 + > > > 3 files changed, 18 insertions(+), 26 deletions(-) > > > > > > diff --git a/lib/learning-switch.c b/lib/learning-switch.c > > > index 04eaa6e8e73f..8102475cae52 100644 > > > --- a/lib/learning-switch.c > > > +++ b/lib/learning-switch.c > > > @@ -305,7 +305,7 @@ lswitch_run(struct lswitch *sw) > > > rconn_run(sw->rconn); > > > > > > if (sw->state == S_CONNECTING) { > > > -if (rconn_get_version(sw->rconn) != -1) { > > > +if (rconn_is_connected(sw->rconn)) { > > > lswitch_handshake(sw); > > > sw->state = S_FEATURES_REPLY; > > > } > > > diff --git a/lib/rconn.c b/lib/rconn.c > > > index 3cad259d0178..a3f811fedbe3 100644 > > > --- a/lib/rconn.c > > > +++ b/lib/rconn.c > > > @@ -141,7 +141,8 @@ struct rconn { > > > struct vconn *monitors[MAXIMUM_MONITORS]; > > > size_t n_monitors; > > > > > > -uint32_t allowed_versions; > > > +uint32_t allowed_versions; /* Acceptable OpenFlow versions. */ > > > +int version;/* Current or most recent version. */ > > > }; > > > > > > /* Counts packets and bytes queued into an rconn by a given source. */ > > > @@ -182,8 +183,6 @@ static bool is_connected_state(enum state); > > > static bool is_admitted_msg(const struct ofpbuf *); > > > static bool rconn_logging_connection_attempts__(const struct rconn *rc) > > > OVS_REQUIRES(rc->mutex); > > > -static int rconn_get_version__(const struct rconn *rconn) > > > -OVS_REQUIRES(rconn->mutex); > > > > > > /* The following prototypes duplicate those in rconn.h, but there we > > weren't > > > * able to add the OVS_EXCLUDED annotations because the definition of > > struct > > > @@ -284,7 +283,9 @@ rconn_create(int probe_interval, int max_backoff, > > uint8_t dscp, > > > rconn_set_dscp(rc, dscp); > > > > > > rc->n_monitors = 0; > > > + > > > rc->allowed_versions = allowed_versions; > > > +rc->version = -1; > > > > > > return rc; > > > } > > > @@ -376,8 +377,7 @@ rconn_connect_unreliably(struct rconn *rc, > > >
Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote: > On Wed, May 23, 2018 at 5:14 PM, Ben Pfaffwrote: > > > > Until now, rconn_get_version() has only reported the OpenFlow version in > > use when the rconn is actually connected. This makes sense, but it has a > > harsh consequence. Consider code like this: > > > > if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) { > > for (int i = 0; i < 2; i++) { > > struct ofpbuf *b = ofputil_encode_echo_request( > > rconn_get_version(rconn)); > > rconn_send(rconn, b, NULL); > > } > > } > > > > Maybe not the smartest code in the world, and probably no one would write > > this exact code in any case, but it doesn't look too risky or crazy. > > > > But it is. The second trip through the loop can assert-fail inside > > ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1 > > instead of a valid OpenFlow version. That happens if the first call to > > rconn_send() encounters an error while sending the message and therefore > > destroys the underlying vconn and disconnects so that rconn_get_version() > > doesn't have a vconn to query for its version. > > > > In a case like this where all the code to send the messages is close by, > we > > could just check rconn_get_version() in each loop iteration. We could > even > > go through the tree and convince ourselves that individual bits of code > are > > safe, or be conservative and check rconn_get_version() >= 0 in the iffy > > cases. But this seems to me like an ongoing source of risk and a way to > > get things wrong in corner cases. > > > > This commit takes a different approach. It introduces a new invariant: if > > an rconn has ever been connected, then it returns a valid OpenFlow version > > from rconn_get_version(). In addition, if an rconn is currently > connected, > > then the OpenFlow version it returns is the correct one (that may be > > obvious, but there were corner cases before where it returned -1 even > > though rconn_is_connected() returned true). > > > > With this commit, the code above would work OK. If the first call to > > rconn_send() encounters an error sending the message, then > > rconn_get_version() in the second iteration will return the same value as > > in the first iteration. The message passed to rconn_send() will end up > > being discarded, but that's much better than either an assertion failure > or > > having to carefully analyze a lot of our code to deal with one unusual > > corner case. > > > > Reported-by: Han Zhou > > Signed-off-by: Ben Pfaff > > --- > > lib/learning-switch.c | 2 +- > > lib/rconn.c | 41 - > > lib/vconn.c | 1 + > > 3 files changed, 18 insertions(+), 26 deletions(-) > > > > diff --git a/lib/learning-switch.c b/lib/learning-switch.c > > index 04eaa6e8e73f..8102475cae52 100644 > > --- a/lib/learning-switch.c > > +++ b/lib/learning-switch.c > > @@ -305,7 +305,7 @@ lswitch_run(struct lswitch *sw) > > rconn_run(sw->rconn); > > > > if (sw->state == S_CONNECTING) { > > -if (rconn_get_version(sw->rconn) != -1) { > > +if (rconn_is_connected(sw->rconn)) { > > lswitch_handshake(sw); > > sw->state = S_FEATURES_REPLY; > > } > > diff --git a/lib/rconn.c b/lib/rconn.c > > index 3cad259d0178..a3f811fedbe3 100644 > > --- a/lib/rconn.c > > +++ b/lib/rconn.c > > @@ -141,7 +141,8 @@ struct rconn { > > struct vconn *monitors[MAXIMUM_MONITORS]; > > size_t n_monitors; > > > > -uint32_t allowed_versions; > > +uint32_t allowed_versions; /* Acceptable OpenFlow versions. */ > > +int version;/* Current or most recent version. */ > > }; > > > > /* Counts packets and bytes queued into an rconn by a given source. */ > > @@ -182,8 +183,6 @@ static bool is_connected_state(enum state); > > static bool is_admitted_msg(const struct ofpbuf *); > > static bool rconn_logging_connection_attempts__(const struct rconn *rc) > > OVS_REQUIRES(rc->mutex); > > -static int rconn_get_version__(const struct rconn *rconn) > > -OVS_REQUIRES(rconn->mutex); > > > > /* The following prototypes duplicate those in rconn.h, but there we > weren't > > * able to add the OVS_EXCLUDED annotations because the definition of > struct > > @@ -284,7 +283,9 @@ rconn_create(int probe_interval, int max_backoff, > uint8_t dscp, > > rconn_set_dscp(rc, dscp); > > > > rc->n_monitors = 0; > > + > > rc->allowed_versions = allowed_versions; > > +rc->version = -1; > > > > return rc; > > } > > @@ -376,8 +377,7 @@ rconn_connect_unreliably(struct rconn *rc, > > rconn_set_target__(rc, vconn_get_name(vconn), name); > > rc->reliable = false; > > rc->vconn = vconn; > > -rc->last_connected = time_now(); > > -state_transition(rc, S_ACTIVE); > > +state_transition(rc,
Re: [ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
On Wed, May 23, 2018 at 5:14 PM, Ben Pfaffwrote: > > Until now, rconn_get_version() has only reported the OpenFlow version in > use when the rconn is actually connected. This makes sense, but it has a > harsh consequence. Consider code like this: > > if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) { > for (int i = 0; i < 2; i++) { > struct ofpbuf *b = ofputil_encode_echo_request( > rconn_get_version(rconn)); > rconn_send(rconn, b, NULL); > } > } > > Maybe not the smartest code in the world, and probably no one would write > this exact code in any case, but it doesn't look too risky or crazy. > > But it is. The second trip through the loop can assert-fail inside > ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1 > instead of a valid OpenFlow version. That happens if the first call to > rconn_send() encounters an error while sending the message and therefore > destroys the underlying vconn and disconnects so that rconn_get_version() > doesn't have a vconn to query for its version. > > In a case like this where all the code to send the messages is close by, we > could just check rconn_get_version() in each loop iteration. We could even > go through the tree and convince ourselves that individual bits of code are > safe, or be conservative and check rconn_get_version() >= 0 in the iffy > cases. But this seems to me like an ongoing source of risk and a way to > get things wrong in corner cases. > > This commit takes a different approach. It introduces a new invariant: if > an rconn has ever been connected, then it returns a valid OpenFlow version > from rconn_get_version(). In addition, if an rconn is currently connected, > then the OpenFlow version it returns is the correct one (that may be > obvious, but there were corner cases before where it returned -1 even > though rconn_is_connected() returned true). > > With this commit, the code above would work OK. If the first call to > rconn_send() encounters an error sending the message, then > rconn_get_version() in the second iteration will return the same value as > in the first iteration. The message passed to rconn_send() will end up > being discarded, but that's much better than either an assertion failure or > having to carefully analyze a lot of our code to deal with one unusual > corner case. > > Reported-by: Han Zhou > Signed-off-by: Ben Pfaff > --- > lib/learning-switch.c | 2 +- > lib/rconn.c | 41 - > lib/vconn.c | 1 + > 3 files changed, 18 insertions(+), 26 deletions(-) > > diff --git a/lib/learning-switch.c b/lib/learning-switch.c > index 04eaa6e8e73f..8102475cae52 100644 > --- a/lib/learning-switch.c > +++ b/lib/learning-switch.c > @@ -305,7 +305,7 @@ lswitch_run(struct lswitch *sw) > rconn_run(sw->rconn); > > if (sw->state == S_CONNECTING) { > -if (rconn_get_version(sw->rconn) != -1) { > +if (rconn_is_connected(sw->rconn)) { > lswitch_handshake(sw); > sw->state = S_FEATURES_REPLY; > } > diff --git a/lib/rconn.c b/lib/rconn.c > index 3cad259d0178..a3f811fedbe3 100644 > --- a/lib/rconn.c > +++ b/lib/rconn.c > @@ -141,7 +141,8 @@ struct rconn { > struct vconn *monitors[MAXIMUM_MONITORS]; > size_t n_monitors; > > -uint32_t allowed_versions; > +uint32_t allowed_versions; /* Acceptable OpenFlow versions. */ > +int version;/* Current or most recent version. */ > }; > > /* Counts packets and bytes queued into an rconn by a given source. */ > @@ -182,8 +183,6 @@ static bool is_connected_state(enum state); > static bool is_admitted_msg(const struct ofpbuf *); > static bool rconn_logging_connection_attempts__(const struct rconn *rc) > OVS_REQUIRES(rc->mutex); > -static int rconn_get_version__(const struct rconn *rconn) > -OVS_REQUIRES(rconn->mutex); > > /* The following prototypes duplicate those in rconn.h, but there we weren't > * able to add the OVS_EXCLUDED annotations because the definition of struct > @@ -284,7 +283,9 @@ rconn_create(int probe_interval, int max_backoff, uint8_t dscp, > rconn_set_dscp(rc, dscp); > > rc->n_monitors = 0; > + > rc->allowed_versions = allowed_versions; > +rc->version = -1; > > return rc; > } > @@ -376,8 +377,7 @@ rconn_connect_unreliably(struct rconn *rc, > rconn_set_target__(rc, vconn_get_name(vconn), name); > rc->reliable = false; > rc->vconn = vconn; > -rc->last_connected = time_now(); > -state_transition(rc, S_ACTIVE); > +state_transition(rc, S_CONNECTING); > ovs_mutex_unlock(>mutex); > } > > @@ -514,6 +514,7 @@ run_CONNECTING(struct rconn *rc) > VLOG_INFO("%s: connected", rc->name); > rc->n_successful_connections++; > state_transition(rc, S_ACTIVE); > +rc->version =
[ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.
Until now, rconn_get_version() has only reported the OpenFlow version in use when the rconn is actually connected. This makes sense, but it has a harsh consequence. Consider code like this: if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) { for (int i = 0; i < 2; i++) { struct ofpbuf *b = ofputil_encode_echo_request( rconn_get_version(rconn)); rconn_send(rconn, b, NULL); } } Maybe not the smartest code in the world, and probably no one would write this exact code in any case, but it doesn't look too risky or crazy. But it is. The second trip through the loop can assert-fail inside ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1 instead of a valid OpenFlow version. That happens if the first call to rconn_send() encounters an error while sending the message and therefore destroys the underlying vconn and disconnects so that rconn_get_version() doesn't have a vconn to query for its version. In a case like this where all the code to send the messages is close by, we could just check rconn_get_version() in each loop iteration. We could even go through the tree and convince ourselves that individual bits of code are safe, or be conservative and check rconn_get_version() >= 0 in the iffy cases. But this seems to me like an ongoing source of risk and a way to get things wrong in corner cases. This commit takes a different approach. It introduces a new invariant: if an rconn has ever been connected, then it returns a valid OpenFlow version from rconn_get_version(). In addition, if an rconn is currently connected, then the OpenFlow version it returns is the correct one (that may be obvious, but there were corner cases before where it returned -1 even though rconn_is_connected() returned true). With this commit, the code above would work OK. If the first call to rconn_send() encounters an error sending the message, then rconn_get_version() in the second iteration will return the same value as in the first iteration. The message passed to rconn_send() will end up being discarded, but that's much better than either an assertion failure or having to carefully analyze a lot of our code to deal with one unusual corner case. Reported-by: Han ZhouSigned-off-by: Ben Pfaff --- lib/learning-switch.c | 2 +- lib/rconn.c | 41 - lib/vconn.c | 1 + 3 files changed, 18 insertions(+), 26 deletions(-) diff --git a/lib/learning-switch.c b/lib/learning-switch.c index 04eaa6e8e73f..8102475cae52 100644 --- a/lib/learning-switch.c +++ b/lib/learning-switch.c @@ -305,7 +305,7 @@ lswitch_run(struct lswitch *sw) rconn_run(sw->rconn); if (sw->state == S_CONNECTING) { -if (rconn_get_version(sw->rconn) != -1) { +if (rconn_is_connected(sw->rconn)) { lswitch_handshake(sw); sw->state = S_FEATURES_REPLY; } diff --git a/lib/rconn.c b/lib/rconn.c index 3cad259d0178..a3f811fedbe3 100644 --- a/lib/rconn.c +++ b/lib/rconn.c @@ -141,7 +141,8 @@ struct rconn { struct vconn *monitors[MAXIMUM_MONITORS]; size_t n_monitors; -uint32_t allowed_versions; +uint32_t allowed_versions; /* Acceptable OpenFlow versions. */ +int version;/* Current or most recent version. */ }; /* Counts packets and bytes queued into an rconn by a given source. */ @@ -182,8 +183,6 @@ static bool is_connected_state(enum state); static bool is_admitted_msg(const struct ofpbuf *); static bool rconn_logging_connection_attempts__(const struct rconn *rc) OVS_REQUIRES(rc->mutex); -static int rconn_get_version__(const struct rconn *rconn) -OVS_REQUIRES(rconn->mutex); /* The following prototypes duplicate those in rconn.h, but there we weren't * able to add the OVS_EXCLUDED annotations because the definition of struct @@ -284,7 +283,9 @@ rconn_create(int probe_interval, int max_backoff, uint8_t dscp, rconn_set_dscp(rc, dscp); rc->n_monitors = 0; + rc->allowed_versions = allowed_versions; +rc->version = -1; return rc; } @@ -376,8 +377,7 @@ rconn_connect_unreliably(struct rconn *rc, rconn_set_target__(rc, vconn_get_name(vconn), name); rc->reliable = false; rc->vconn = vconn; -rc->last_connected = time_now(); -state_transition(rc, S_ACTIVE); +state_transition(rc, S_CONNECTING); ovs_mutex_unlock(>mutex); } @@ -514,6 +514,7 @@ run_CONNECTING(struct rconn *rc) VLOG_INFO("%s: connected", rc->name); rc->n_successful_connections++; state_transition(rc, S_ACTIVE); +rc->version = vconn_get_version(rc->vconn); rc->last_connected = rc->state_entered; } else if (retval != EAGAIN) { if (rconn_logging_connection_attempts__(rc)) { @@ -575,14 +576,8 @@ run_ACTIVE(struct rconn *rc) * can end up queuing a packet with vconn == NULL and then