Re: [ovs-discuss] [openvswitch 2.10.0+2018.08.28+git.e0cea85314+ds2] testsuite: 975 2347 2482 2483 2633 failed
On Wed, Sep 05, 2018 at 01:50:06PM +0200, Thomas Goirand wrote: > On 09/04/2018 11:06 PM, Ben Pfaff wrote: > > On Tue, Sep 04, 2018 at 09:20:45AM +0200, Thomas Goirand wrote: > >> On 09/02/2018 03:12 AM, Justin Pettit wrote: > >>> > On Sep 1, 2018, at 3:52 PM, Ben Pfaff wrote: > > On Sat, Sep 01, 2018 at 01:23:32PM -0700, Justin Pettit wrote: > > > >> On Sep 1, 2018, at 12:21 PM, Thomas Goirand wrote: > >> > >> > >> The only one failure: > >> > >> 2633: ovn -- ACL rate-limited logging FAILED > >> (ovn.at:6516) > > > > My guess if that this is meter-related. Can you send the > > ovs-vswitchd.log and testsuite.log so I can take a look? > > It probably hasn't changed from what he sent the first time around. > >>> > >>> Yes, "testsuite.log" was in the original message, so I don't need that. > >>> Thomas, can you send me "ovs-vswitchd.log" and "ovn-controller.log"? > >>> Does it consistently fail for you? > >>> > >>> --Justin > >> > >> Hi, > >> > >> As I blacklisted the above test, I uploaded to Sid, and now there's a > >> number of failures on non-intel arch: > >> > >> https://buildd.debian.org/status/package.php?p=openvswitch > >> https://buildd.debian.org/status/logs.php?pkg=openvswitch > >> > >> Ben, Justin, can you help me fix all of this? > > > > Thanks for passing that along. > > > > A lot of these failures seem to involve unexpected timeouts. I wonder > > whether the buildds are so overloaded that some of the 10-second > > timeouts in the testsuite are just too short. Usually, this is a > > generous timeout interval. > > > > I sent a patch that should help to debug the problem by doing more logging: > > https://patchwork.ozlabs.org/patch/966087/ > > > > It won't help with tests that fully succeed, because the logs by default > > are discarded, but for tests that have a sequence of waits, in which one > > eventually fails, it will allow us to see how long the successful waits > > took. > > > > Any chance you could apply that patch and try another build? Feel free > > to wait for review, if you prefer. > > > > Hi, > > I've just uploaded OVS with that patch. Thanks, I think it's a very good > idea. And indeed, it looks like failing arch are the slower ones. I'm pretty pleased with the theory myself, but the results tend to show that it wasn't the problem. In most of the tests that eventually failed, the wait failure was preceded by other waits that succeeded immediately, and the longest wait I see is 3 seconds. I'll look for other possible causes. ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] Possible data loss of OVSDB active-backup mode
On Wed, Sep 5, 2018 at 10:44 AM aginwala wrote: > > Thanks Numan: > > I will give it shot and update the findings. > > > On Wed, Sep 5, 2018 at 5:35 AM Numan Siddique wrote: >> >> >> >> On Wed, Sep 5, 2018 at 12:42 AM Han Zhou wrote: >>> >>> >>> >>> On Sun, Sep 2, 2018 at 11:01 PM Numan Siddique wrote: >>> > >>> > >>> > >>> > On Fri, Aug 10, 2018 at 3:59 AM Ben Pfaff wrote: >>> >> >>> >> On Thu, Aug 09, 2018 at 09:32:21AM -0700, Han Zhou wrote: >>> >> > On Thu, Aug 9, 2018 at 1:57 AM, aginwala wrote: >>> >> > > >>> >> > > >>> >> > > To add on , we are using LB VIP IP and no constraint with 3 nodes as Han >>> >> > mentioned earlier where active node have syncs from invalid IP and rest >>> >> > two nodes sync from LB VIP IP. Also, I was able to get some logs from one >>> >> > node that triggered: >>> >> > https://github.com/openvswitch/ovs/blob/master/ovsdb/ovsdb-server.c#L460 >>> >> > > >>> >> > > 2018-08-04T01:43:39.914Z|03230|reconnect|DBG|tcp: 10.189.208.16:50686: >>> >> > entering RECONNECT >>> >> > > 2018-08-04T01:43:39.914Z|03231|ovsdb_jsonrpc_server|INFO|tcp: >>> >> > 10.189.208.16:50686: disconnecting (removing OVN_Northbound database due to >>> >> > server termination) >>> >> > > 2018-08-04T01:43:39.932Z|03232|ovsdb_jsonrpc_server|INFO|tcp: >>> >> > 10.189.208.21:56160: disconnecting (removing _Server database due to server >>> >> > termination) >>> >> > > 20 >>> >> > > >>> >> > > I am not sure if sync_from on active node too via some invalid ip is >>> >> > causing some flaw when all are down during the race condition in this >>> >> > corner case. >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > On Thu, Aug 9, 2018 at 1:35 AM Numan Siddique < nusid...@redhat.com> wrote: >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > >> On Thu, Aug 9, 2018 at 1:07 AM Ben Pfaff wrote: >>> >> > >>> >>> >> > >>> On Wed, Aug 08, 2018 at 12:18:10PM -0700, Han Zhou wrote: >>> >> > >>> > On Wed, Aug 8, 2018 at 11:24 AM, Ben Pfaff wrote: >>> >> > >>> > > >>> >> > >>> > > On Wed, Aug 08, 2018 at 12:37:04AM -0700, Han Zhou wrote: >>> >> > >>> > > > Hi, >>> >> > >>> > > > >>> >> > >>> > > > We found an issue in our testing (thanks aginwala) with >>> >> > active-backup >>> >> > >>> > mode >>> >> > >>> > > > in OVN setup. >>> >> > >>> > > > In the 3 node setup with pacemaker, after stopping pacemaker on >>> >> > all >>> >> > >>> > three >>> >> > >>> > > > nodes (simulate a complete shutdown), and then if starting all of >>> >> > them >>> >> > >>> > > > simultaneously, there is a good chance that the whole DB content >>> >> > gets >>> >> > >>> > lost. >>> >> > >>> > > > >>> >> > >>> > > > After studying the replication code, it seems there is a phase >>> >> > that the >>> >> > >>> > > > backup node deletes all its data and wait for data to be synced >>> >> > from the >>> >> > >>> > > > active node: >>> >> > >>> > > > >>> >> > https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306 >>> >> > >>> > > > >>> >> > >>> > > > At this state, if the node was set to active, then all data is >>> >> > gone for >>> >> > >>> > the >>> >> > >>> > > > whole cluster. This can happen in different situations. In the >>> >> > test >>> >> > >>> > > > scenario mentioned above it is very likely to happen, since >>> >> > pacemaker >>> >> > >>> > just >>> >> > >>> > > > randomly select one as master, not knowing the internal sync >>> >> > state of >>> >> > >>> > each >>> >> > >>> > > > node. It could also happen when failover happens right after a new >>> >> > >>> > backup >>> >> > >>> > > > is started, although less likely in real environment, so starting >>> >> > up >>> >> > >>> > node >>> >> > >>> > > > one by one may largely reduce the probability. >>> >> > >>> > > > >>> >> > >>> > > > Does this analysis make sense? We will do more tests to verify the >>> >> > >>> > > > conclusion, but would like to share with community for >>> >> > discussions and >>> >> > >>> > > > suggestions. Once this happens it is very critical - even more >>> >> > serious >>> >> > >>> > than >>> >> > >>> > > > just no HA. Without HA it is just control plane outage, but this >>> >> > would >>> >> > >>> > be >>> >> > >>> > > > data plane outage because OVS flows will be removed accordingly >>> >> > since >>> >> > >>> > the >>> >> > >>> > > > data is considered as deleted from ovn-controller point of view. >>> >> > >>> > > > >>> >> > >>> > > > We understand that active-standby is not the ideal HA mechanism >>> >> > and >>> >> > >>> > > > clustering is the future, and we are also testing the clustering >>> >> > with >>> >> > >>> > the >>> >> > >>> > > > latest patch. But it would be good if this problem can be >>> >> > addressed with >>> >> > >>> > > > some quick fix, such as keep a copy of the old data somewhere >>> >> > until the >>> >> > >>> > > > first sync finishes? >>> >> > >>> > > >>> >> > >>> > > This does seem like a plausible bug, and at first glance I believe >>> >> > that >>> >> > >>> > > you're correct about the race here. I
Re: [ovs-discuss] Possible data loss of OVSDB active-backup mode
Thanks Numan: I will give it shot and update the findings. On Wed, Sep 5, 2018 at 5:35 AM Numan Siddique wrote: > > > On Wed, Sep 5, 2018 at 12:42 AM Han Zhou wrote: > >> >> >> On Sun, Sep 2, 2018 at 11:01 PM Numan Siddique >> wrote: >> > >> > >> > >> > On Fri, Aug 10, 2018 at 3:59 AM Ben Pfaff wrote: >> >> >> >> On Thu, Aug 09, 2018 at 09:32:21AM -0700, Han Zhou wrote: >> >> > On Thu, Aug 9, 2018 at 1:57 AM, aginwala wrote: >> >> > > >> >> > > >> >> > > To add on , we are using LB VIP IP and no constraint with 3 nodes >> as Han >> >> > mentioned earlier where active node have syncs from invalid IP and >> rest >> >> > two nodes sync from LB VIP IP. Also, I was able to get some logs >> from one >> >> > node that triggered: >> >> > >> https://github.com/openvswitch/ovs/blob/master/ovsdb/ovsdb-server.c#L460 >> >> > > >> >> > > 2018-08-04T01:43:39.914Z|03230|reconnect|DBG|tcp: >> 10.189.208.16:50686: >> >> > entering RECONNECT >> >> > > 2018-08-04T01:43:39.914Z|03231|ovsdb_jsonrpc_server|INFO|tcp: >> >> > 10.189.208.16:50686: disconnecting (removing OVN_Northbound >> database due to >> >> > server termination) >> >> > > 2018-08-04T01:43:39.932Z|03232|ovsdb_jsonrpc_server|INFO|tcp: >> >> > 10.189.208.21:56160: disconnecting (removing _Server database due >> to server >> >> > termination) >> >> > > 20 >> >> > > >> >> > > I am not sure if sync_from on active node too via some invalid ip >> is >> >> > causing some flaw when all are down during the race condition in this >> >> > corner case. >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > On Thu, Aug 9, 2018 at 1:35 AM Numan Siddique >> wrote: >> >> > >> >> >> > >> >> >> > >> >> >> > >> On Thu, Aug 9, 2018 at 1:07 AM Ben Pfaff wrote: >> >> > >>> >> >> > >>> On Wed, Aug 08, 2018 at 12:18:10PM -0700, Han Zhou wrote: >> >> > >>> > On Wed, Aug 8, 2018 at 11:24 AM, Ben Pfaff >> wrote: >> >> > >>> > > >> >> > >>> > > On Wed, Aug 08, 2018 at 12:37:04AM -0700, Han Zhou wrote: >> >> > >>> > > > Hi, >> >> > >>> > > > >> >> > >>> > > > We found an issue in our testing (thanks aginwala) with >> >> > active-backup >> >> > >>> > mode >> >> > >>> > > > in OVN setup. >> >> > >>> > > > In the 3 node setup with pacemaker, after stopping >> pacemaker on >> >> > all >> >> > >>> > three >> >> > >>> > > > nodes (simulate a complete shutdown), and then if starting >> all of >> >> > them >> >> > >>> > > > simultaneously, there is a good chance that the whole DB >> content >> >> > gets >> >> > >>> > lost. >> >> > >>> > > > >> >> > >>> > > > After studying the replication code, it seems there is a >> phase >> >> > that the >> >> > >>> > > > backup node deletes all its data and wait for data to be >> synced >> >> > from the >> >> > >>> > > > active node: >> >> > >>> > > > >> >> > >> https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306 >> >> > >>> > > > >> >> > >>> > > > At this state, if the node was set to active, then all >> data is >> >> > gone for >> >> > >>> > the >> >> > >>> > > > whole cluster. This can happen in different situations. In >> the >> >> > test >> >> > >>> > > > scenario mentioned above it is very likely to happen, since >> >> > pacemaker >> >> > >>> > just >> >> > >>> > > > randomly select one as master, not knowing the internal >> sync >> >> > state of >> >> > >>> > each >> >> > >>> > > > node. It could also happen when failover happens right >> after a new >> >> > >>> > backup >> >> > >>> > > > is started, although less likely in real environment, so >> starting >> >> > up >> >> > >>> > node >> >> > >>> > > > one by one may largely reduce the probability. >> >> > >>> > > > >> >> > >>> > > > Does this analysis make sense? We will do more tests to >> verify the >> >> > >>> > > > conclusion, but would like to share with community for >> >> > discussions and >> >> > >>> > > > suggestions. Once this happens it is very critical - even >> more >> >> > serious >> >> > >>> > than >> >> > >>> > > > just no HA. Without HA it is just control plane outage, >> but this >> >> > would >> >> > >>> > be >> >> > >>> > > > data plane outage because OVS flows will be removed >> accordingly >> >> > since >> >> > >>> > the >> >> > >>> > > > data is considered as deleted from ovn-controller point of >> view. >> >> > >>> > > > >> >> > >>> > > > We understand that active-standby is not the ideal HA >> mechanism >> >> > and >> >> > >>> > > > clustering is the future, and we are also testing the >> clustering >> >> > with >> >> > >>> > the >> >> > >>> > > > latest patch. But it would be good if this problem can be >> >> > addressed with >> >> > >>> > > > some quick fix, such as keep a copy of the old data >> somewhere >> >> > until the >> >> > >>> > > > first sync finishes? >> >> > >>> > > >> >> > >>> > > This does seem like a plausible bug, and at first glance I >> believe >> >> > that >> >> > >>> > > you're correct about the race here. I guess that the correct >> >> > behavior >> >> > >>> > > must be to keep the original data until a new copy of
Re: [ovs-discuss] Geneve remote_ip as flow for OVN hosts
Hello all, I would like to add more context here. In the diagram below +--+ |ovn-host | | | | | | +-+| | | br-int || | ++-+--+| || | | | +--v-+ +---v+ | | | geneve | | geneve | | | +--+-+ +---++ | || | | | +-v+ +--v---+ | | | IP0 | | IP1 | | | +--+ +--+ | +--+ eth0 +-+ eth1 +---+ +--+ +--+ eth0 and eth are, say, in its own physical segments. The VMs that are instantiated in the above ovn-host will have multiple interfaces and each of those interface need to be on a different Geneve VTEP. I think the following entry in OVN TODOs ( https://github.com/openvswitch/ovs/blob/master/ovn/TODO.rst) ---8<--8<--- Support multiple tunnel encapsulations in Chassis. So far, both ovn-controller and ovn-controller-vtep only allow chassis to have one tunnel encapsulation entry. We should extend the implementation to support multiple tunnel encapsulations ---8<--8<--- captures the above requirement. Is that the case? Thanks again. Regards, ~Girish On Tue, Sep 4, 2018 at 3:00 PM Girish Moodalbail wrote: > Hello all, > > Is it possible to configure remote_ip as a 'flow' instead of an IP address > (i.e., setting ovn-encap-ip to a single IP address)? > > Today, we have one VTEP endpoint per OVN host and all the VMs that > connects to br-int on that OVN host are reachable behind this VTEP > endpoint. Is it possible to have multiple VTEP endpoints for a br-int > bridge and use Open Flow flows to select one of the VTEP endpoint? > > > +--+ > |ovn-host | > | | > | | > | +-+| > | | br-int || > | ++-+--+| > || | | > | +--v-+ +---v+ | > | | geneve | | geneve | | > | +--+-+ +---++ | > || | | > | +-v+ +--v---+ | > | | IP0 | | IP1 | | > | +--+ +--+ | > +--+ eth0 +-+ eth1 +---+ >+--+ +--+ > > Also, we don't want to bond eth0 and eth1 into a bond interface and then > use bond's IP as VTEP endpoint. > > Thanks in advance, > ~Girish > > > > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] Possible data loss of OVSDB active-backup mode
On Wed, Sep 5, 2018 at 12:42 AM Han Zhou wrote: > > > On Sun, Sep 2, 2018 at 11:01 PM Numan Siddique > wrote: > > > > > > > > On Fri, Aug 10, 2018 at 3:59 AM Ben Pfaff wrote: > >> > >> On Thu, Aug 09, 2018 at 09:32:21AM -0700, Han Zhou wrote: > >> > On Thu, Aug 9, 2018 at 1:57 AM, aginwala wrote: > >> > > > >> > > > >> > > To add on , we are using LB VIP IP and no constraint with 3 nodes > as Han > >> > mentioned earlier where active node have syncs from invalid IP and > rest > >> > two nodes sync from LB VIP IP. Also, I was able to get some logs from > one > >> > node that triggered: > >> > > https://github.com/openvswitch/ovs/blob/master/ovsdb/ovsdb-server.c#L460 > >> > > > >> > > 2018-08-04T01:43:39.914Z|03230|reconnect|DBG|tcp: > 10.189.208.16:50686: > >> > entering RECONNECT > >> > > 2018-08-04T01:43:39.914Z|03231|ovsdb_jsonrpc_server|INFO|tcp: > >> > 10.189.208.16:50686: disconnecting (removing OVN_Northbound database > due to > >> > server termination) > >> > > 2018-08-04T01:43:39.932Z|03232|ovsdb_jsonrpc_server|INFO|tcp: > >> > 10.189.208.21:56160: disconnecting (removing _Server database due to > server > >> > termination) > >> > > 20 > >> > > > >> > > I am not sure if sync_from on active node too via some invalid ip is > >> > causing some flaw when all are down during the race condition in this > >> > corner case. > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > On Thu, Aug 9, 2018 at 1:35 AM Numan Siddique > wrote: > >> > >> > >> > >> > >> > >> > >> > >> On Thu, Aug 9, 2018 at 1:07 AM Ben Pfaff wrote: > >> > >>> > >> > >>> On Wed, Aug 08, 2018 at 12:18:10PM -0700, Han Zhou wrote: > >> > >>> > On Wed, Aug 8, 2018 at 11:24 AM, Ben Pfaff wrote: > >> > >>> > > > >> > >>> > > On Wed, Aug 08, 2018 at 12:37:04AM -0700, Han Zhou wrote: > >> > >>> > > > Hi, > >> > >>> > > > > >> > >>> > > > We found an issue in our testing (thanks aginwala) with > >> > active-backup > >> > >>> > mode > >> > >>> > > > in OVN setup. > >> > >>> > > > In the 3 node setup with pacemaker, after stopping > pacemaker on > >> > all > >> > >>> > three > >> > >>> > > > nodes (simulate a complete shutdown), and then if starting > all of > >> > them > >> > >>> > > > simultaneously, there is a good chance that the whole DB > content > >> > gets > >> > >>> > lost. > >> > >>> > > > > >> > >>> > > > After studying the replication code, it seems there is a > phase > >> > that the > >> > >>> > > > backup node deletes all its data and wait for data to be > synced > >> > from the > >> > >>> > > > active node: > >> > >>> > > > > >> > > https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306 > >> > >>> > > > > >> > >>> > > > At this state, if the node was set to active, then all data > is > >> > gone for > >> > >>> > the > >> > >>> > > > whole cluster. This can happen in different situations. In > the > >> > test > >> > >>> > > > scenario mentioned above it is very likely to happen, since > >> > pacemaker > >> > >>> > just > >> > >>> > > > randomly select one as master, not knowing the internal sync > >> > state of > >> > >>> > each > >> > >>> > > > node. It could also happen when failover happens right > after a new > >> > >>> > backup > >> > >>> > > > is started, although less likely in real environment, so > starting > >> > up > >> > >>> > node > >> > >>> > > > one by one may largely reduce the probability. > >> > >>> > > > > >> > >>> > > > Does this analysis make sense? We will do more tests to > verify the > >> > >>> > > > conclusion, but would like to share with community for > >> > discussions and > >> > >>> > > > suggestions. Once this happens it is very critical - even > more > >> > serious > >> > >>> > than > >> > >>> > > > just no HA. Without HA it is just control plane outage, but > this > >> > would > >> > >>> > be > >> > >>> > > > data plane outage because OVS flows will be removed > accordingly > >> > since > >> > >>> > the > >> > >>> > > > data is considered as deleted from ovn-controller point of > view. > >> > >>> > > > > >> > >>> > > > We understand that active-standby is not the ideal HA > mechanism > >> > and > >> > >>> > > > clustering is the future, and we are also testing the > clustering > >> > with > >> > >>> > the > >> > >>> > > > latest patch. But it would be good if this problem can be > >> > addressed with > >> > >>> > > > some quick fix, such as keep a copy of the old data > somewhere > >> > until the > >> > >>> > > > first sync finishes? > >> > >>> > > > >> > >>> > > This does seem like a plausible bug, and at first glance I > believe > >> > that > >> > >>> > > you're correct about the race here. I guess that the correct > >> > behavior > >> > >>> > > must be to keep the original data until a new copy of the > data has > >> > been > >> > >>> > > received, and only then atomically replace the original by > the new. > >> > >>> > > > >> > >>> > > Is this something you have time and ability to fix? > >> > >>> > > >> > >>> > Thanks Ben for quick response. I guess I will not have time
Re: [ovs-discuss] [openvswitch 2.10.0+2018.08.28+git.e0cea85314+ds2] testsuite: 975 2347 2482 2483 2633 failed
On 09/04/2018 11:06 PM, Ben Pfaff wrote: > On Tue, Sep 04, 2018 at 09:20:45AM +0200, Thomas Goirand wrote: >> On 09/02/2018 03:12 AM, Justin Pettit wrote: >>> On Sep 1, 2018, at 3:52 PM, Ben Pfaff wrote: On Sat, Sep 01, 2018 at 01:23:32PM -0700, Justin Pettit wrote: > >> On Sep 1, 2018, at 12:21 PM, Thomas Goirand wrote: >> >> >> The only one failure: >> >> 2633: ovn -- ACL rate-limited logging FAILED >> (ovn.at:6516) > > My guess if that this is meter-related. Can you send the ovs-vswitchd.log > and testsuite.log so I can take a look? It probably hasn't changed from what he sent the first time around. >>> >>> Yes, "testsuite.log" was in the original message, so I don't need that. >>> Thomas, can you send me "ovs-vswitchd.log" and "ovn-controller.log"? Does >>> it consistently fail for you? >>> >>> --Justin >> >> Hi, >> >> As I blacklisted the above test, I uploaded to Sid, and now there's a >> number of failures on non-intel arch: >> >> https://buildd.debian.org/status/package.php?p=openvswitch >> https://buildd.debian.org/status/logs.php?pkg=openvswitch >> >> Ben, Justin, can you help me fix all of this? > > Thanks for passing that along. > > A lot of these failures seem to involve unexpected timeouts. I wonder > whether the buildds are so overloaded that some of the 10-second > timeouts in the testsuite are just too short. Usually, this is a > generous timeout interval. > > I sent a patch that should help to debug the problem by doing more logging: > https://patchwork.ozlabs.org/patch/966087/ > > It won't help with tests that fully succeed, because the logs by default > are discarded, but for tests that have a sequence of waits, in which one > eventually fails, it will allow us to see how long the successful waits > took. > > Any chance you could apply that patch and try another build? Feel free > to wait for review, if you prefer. > Hi, I've just uploaded OVS with that patch. Thanks, I think it's a very good idea. And indeed, it looks like failing arch are the slower ones. Cheers, Thomas Goirand (zigo) ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
[ovs-discuss] Regarding kernel module debugging
hello everyone, i am new to kernel module programming . can any one please tell me, how we can debug the datapth kernel modules eg. openvswitch.ko. actually i wanted to get a real feeling, that how the things move in ovs. it will be great if you could list our the tool name and steps to debug the kernel module. Thanks vikash ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss