Re: [ovs-dev] [ovs-discuss] ovsdb-server core dump and ovsdb corruption using raft cluster

2018-08-06 Thread Ben Pfaff
With Guru's help, I believe I have fixed it:
https://patchwork.ozlabs.org/patch/954247/

On Wed, Aug 01, 2018 at 11:46:38AM -0700, Guru Shetty wrote:
> I was able to reproduce it. I will work with Ben to get this fixed.
> 
> On 26 July 2018 at 23:14, Girish Moodalbail  wrote:
> 
> > Hello Ben,
> >
> > Sorry, got distracted with something else at work. I am still able to
> > reproduce the issue, and this is what I have and what I did
> > (if you need the core, let me know and I can share it with you)
> >
> > - 3-cluster RAFT setup in Ubuntu VM (2 VCPUs with 8GB RAM)
> >   $ uname -r
> >   Linux u1804-HVM-domU 4.15.0-23-generic #25-Ubuntu SMP Wed May 23
> > 18:02:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> >
> > - On all of the VMs, I have installed openvswitch-switch=2.9.2,
> > openvswitch-dbg=2.9.2, and ovn-central=2.9.2
> >   (all of these packages are from http://packages.wand.net.nz/)
> >
> > - I bring up the node in the cluster one after the other -- leader 1st and
> > followed by two followers
> > - I check for cluster status and everything is healthy
> > - ovn-nbctl show and ovn-sbctl show is all empty
> >
> > - on the leader with OVN_NB_DB set to comma-separated-NB connection
> > strings I did
> >for i in `seq 1 50`; do ovn-nbclt ls-add ls$i; ovn-nbctl lsp-add ls$i
> > port0_$i; done
> >
> > - Check for the presence of 50 logical switches and 50 logical ports (one
> > on each switch). Compact the database on all the nodes.
> >
> > - Next I try to delete the ports and whilst the deletion is happening I
> > run compact on one of the followers
> >
> >   leader_node# for i in `seq  1 50`; do ovn-nbctl lsp-del port0_$i;done
> >   follower_node# ovs-appctl -t /var/run/openvswitch/ovnnb_db.ctl
> > ovsdb-server/compact OVN_Northbound
> >
> > - On the follower node I see the crash:
> >
> > ● ovn-central.service - LSB: OVN central components
> >Loaded: loaded (/etc/init.d/ovn-central; generated)
> >Active: active (running) since Thu 2018-07-26 22:48:53 PDT; 19min ago
> >  Docs: man:systemd-sysv-generator(8)
> >   Process: 21883 ExecStop=/etc/init.d/ovn-central stop (code=exited,
> > status=0/SUCCESS)
> >   Process: 21934 ExecStart=/etc/init.d/ovn-central start (code=exited,
> > status=0/SUCCESS)
> > Tasks: 10 (limit: 4915)
> >CGroup: /system.slice/ovn-central.service
> >├─22047 ovsdb-server: monitoring pid 22134 (*1 crashes: pid
> > 22048 died, killed (Aborted), core dumped*
> >├─22059 ovsdb-server: monitoring pid 22060 (healthy)
> >├─22060 ovsdb-server -vconsole:off -vfile:info
> > --log-file=/var/log/openvswitch/ovsdb-server-sb.log -
> >├─22072 ovn-northd: monitoring pid 22073 (healthy)
> >├─22073 ovn-northd -vconsole:emer -vsyslog:err -vfile:info
> > --ovnnb-db=tcp:10.0.7.33:6641,tcp:10.0.7.
> >└─22134 ovsdb-server -vconsole:off -vfile:info
> > --log-file=/var/log/openvswitch/ovsdb-server-nb.log
> >
> >
> > Same call trace and reason:
> >
> > #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> > #1  0x7f79599a1801 in __GI_abort () at abort.c:79
> > #2  0x5596879c017c in json_serialize (json=,
> > s=) at ../lib/json.c:1554
> > #3  0x5596879c01eb in json_serialize_object_member (i=,
> > s=, node=, node=) at
> > ../lib/json.c:1583
> > #4  0x5596879c0132 in json_serialize_object (s=0x7ffc17013bf0,
> > object=0x55968993dcb0) at ../lib/json.c:1612
> > #5  json_serialize (json=, s=0x7ffc17013bf0) at
> > ../lib/json.c:1533
> > #6  0x5596879c249c in json_to_ds (json=json@entry=0x559689950670,
> > flags=flags@entry=0, ds=ds@entry=0x7ffc17013c80) at ../lib/json.c:1511
> > #7  0x5596879ae8df in ovsdb_log_compose_record 
> > (json=json@entry=0x559689950670,
> > magic=0x55968993dc60 "CLUSTER", header=header@entry=0x7ffc17013c60,
> > data=data@entry=0x7ffc17013c80) at ../ovsdb/log.c:570
> > #8  0x5596879aebbf in ovsdb_log_write (file=0x5596899b5df0,
> > json=0x559689950670) at ../ovsdb/log.c:618
> > #9  0x5596879aed3e in ovsdb_log_write_and_free 
> > (log=log@entry=0x5596899b5df0,
> > json=0x559689950670) at ../ovsdb/log.c:651
> > #10 0x5596879b0954 in raft_write_snapshot 
> > (raft=raft@entry=0x5596899151a0,
> > log=0x5596899b5df0, new_log_start=new_log_start@entry=166,
> > new_snapshot=new_snapshot@entry=0x7ffc17013e30) at
> > ../ovsdb/raft.c:3588
> > #11 0x5596879b0ec3 in raft_save_snapshot 
> > (raft=raft@entry=0x5596899151a0,
> > new_start=new_start@entry=166, new_snapshot=new_snapshot@
> > entry=0x7ffc17013e30)
> > at ../ovsdb/raft.c:3647
> > #12 0x5596879b8aed in raft_store_snapshot (raft=0x5596899151a0,
> > new_snapshot_data=new_snapshot_data@entry=0x5596899505f0) at
> > ../ovsdb/raft.c:3849
> > #13 0x5596879a579e in ovsdb_storage_store_snapshot__
> > (storage=0x5596899137a0, schema=0x559689938ca0, data=0x559689946ea0) at
> > ../ovsdb/storage.c:541
> > #14 0x5596879a625e in ovsdb_storage_store_snapshot
> > 

Re: [ovs-dev] [ovs-discuss] ovsdb-server core dump and ovsdb corruption using raft cluster

2018-08-01 Thread Guru Shetty
I was able to reproduce it. I will work with Ben to get this fixed.

On 26 July 2018 at 23:14, Girish Moodalbail  wrote:

> Hello Ben,
>
> Sorry, got distracted with something else at work. I am still able to
> reproduce the issue, and this is what I have and what I did
> (if you need the core, let me know and I can share it with you)
>
> - 3-cluster RAFT setup in Ubuntu VM (2 VCPUs with 8GB RAM)
>   $ uname -r
>   Linux u1804-HVM-domU 4.15.0-23-generic #25-Ubuntu SMP Wed May 23
> 18:02:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> - On all of the VMs, I have installed openvswitch-switch=2.9.2,
> openvswitch-dbg=2.9.2, and ovn-central=2.9.2
>   (all of these packages are from http://packages.wand.net.nz/)
>
> - I bring up the node in the cluster one after the other -- leader 1st and
> followed by two followers
> - I check for cluster status and everything is healthy
> - ovn-nbctl show and ovn-sbctl show is all empty
>
> - on the leader with OVN_NB_DB set to comma-separated-NB connection
> strings I did
>for i in `seq 1 50`; do ovn-nbclt ls-add ls$i; ovn-nbctl lsp-add ls$i
> port0_$i; done
>
> - Check for the presence of 50 logical switches and 50 logical ports (one
> on each switch). Compact the database on all the nodes.
>
> - Next I try to delete the ports and whilst the deletion is happening I
> run compact on one of the followers
>
>   leader_node# for i in `seq  1 50`; do ovn-nbctl lsp-del port0_$i;done
>   follower_node# ovs-appctl -t /var/run/openvswitch/ovnnb_db.ctl
> ovsdb-server/compact OVN_Northbound
>
> - On the follower node I see the crash:
>
> ● ovn-central.service - LSB: OVN central components
>Loaded: loaded (/etc/init.d/ovn-central; generated)
>Active: active (running) since Thu 2018-07-26 22:48:53 PDT; 19min ago
>  Docs: man:systemd-sysv-generator(8)
>   Process: 21883 ExecStop=/etc/init.d/ovn-central stop (code=exited,
> status=0/SUCCESS)
>   Process: 21934 ExecStart=/etc/init.d/ovn-central start (code=exited,
> status=0/SUCCESS)
> Tasks: 10 (limit: 4915)
>CGroup: /system.slice/ovn-central.service
>├─22047 ovsdb-server: monitoring pid 22134 (*1 crashes: pid
> 22048 died, killed (Aborted), core dumped*
>├─22059 ovsdb-server: monitoring pid 22060 (healthy)
>├─22060 ovsdb-server -vconsole:off -vfile:info
> --log-file=/var/log/openvswitch/ovsdb-server-sb.log -
>├─22072 ovn-northd: monitoring pid 22073 (healthy)
>├─22073 ovn-northd -vconsole:emer -vsyslog:err -vfile:info
> --ovnnb-db=tcp:10.0.7.33:6641,tcp:10.0.7.
>└─22134 ovsdb-server -vconsole:off -vfile:info
> --log-file=/var/log/openvswitch/ovsdb-server-nb.log
>
>
> Same call trace and reason:
>
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x7f79599a1801 in __GI_abort () at abort.c:79
> #2  0x5596879c017c in json_serialize (json=,
> s=) at ../lib/json.c:1554
> #3  0x5596879c01eb in json_serialize_object_member (i=,
> s=, node=, node=) at
> ../lib/json.c:1583
> #4  0x5596879c0132 in json_serialize_object (s=0x7ffc17013bf0,
> object=0x55968993dcb0) at ../lib/json.c:1612
> #5  json_serialize (json=, s=0x7ffc17013bf0) at
> ../lib/json.c:1533
> #6  0x5596879c249c in json_to_ds (json=json@entry=0x559689950670,
> flags=flags@entry=0, ds=ds@entry=0x7ffc17013c80) at ../lib/json.c:1511
> #7  0x5596879ae8df in ovsdb_log_compose_record 
> (json=json@entry=0x559689950670,
> magic=0x55968993dc60 "CLUSTER", header=header@entry=0x7ffc17013c60,
> data=data@entry=0x7ffc17013c80) at ../ovsdb/log.c:570
> #8  0x5596879aebbf in ovsdb_log_write (file=0x5596899b5df0,
> json=0x559689950670) at ../ovsdb/log.c:618
> #9  0x5596879aed3e in ovsdb_log_write_and_free 
> (log=log@entry=0x5596899b5df0,
> json=0x559689950670) at ../ovsdb/log.c:651
> #10 0x5596879b0954 in raft_write_snapshot (raft=raft@entry=0x5596899151a0,
> log=0x5596899b5df0, new_log_start=new_log_start@entry=166,
> new_snapshot=new_snapshot@entry=0x7ffc17013e30) at
> ../ovsdb/raft.c:3588
> #11 0x5596879b0ec3 in raft_save_snapshot (raft=raft@entry=0x5596899151a0,
> new_start=new_start@entry=166, new_snapshot=new_snapshot@
> entry=0x7ffc17013e30)
> at ../ovsdb/raft.c:3647
> #12 0x5596879b8aed in raft_store_snapshot (raft=0x5596899151a0,
> new_snapshot_data=new_snapshot_data@entry=0x5596899505f0) at
> ../ovsdb/raft.c:3849
> #13 0x5596879a579e in ovsdb_storage_store_snapshot__
> (storage=0x5596899137a0, schema=0x559689938ca0, data=0x559689946ea0) at
> ../ovsdb/storage.c:541
> #14 0x5596879a625e in ovsdb_storage_store_snapshot
> (storage=0x5596899137a0, schema=schema@entry=0x559689938ca0,
> data=data@entry=0x559689946ea0) at ../ovsdb/storage.c:568
> #15 0x55968799f5ab in ovsdb_snapshot (db=0x5596899137e0) at
> ../ovsdb/ovsdb.c:519
> #16 0x559687999f23 in ovsdb_server_compact (conn=0x559689938440,
> argc=, argv=, dbs_=0x7ffc170141c0) at
> ../ovsdb/ovsdb-server.c:1443
> #17 0x5596879d9cc0 in 

Re: [ovs-dev] [ovs-discuss] ovsdb-server core dump and ovsdb corruption using raft cluster

2018-07-31 Thread Girish Moodalbail
Hello Ben/Guru,

Wanted to check if you were able to reproduce the issue on your end, and
whether you guys needed any more info from me.
If you guys have any patch, then we are more than happy to verify it.

regards,
~Girish

On Thu, Jul 26, 2018 at 11:14 PM, Girish Moodalbail 
wrote:

> Hello Ben,
>
> Sorry, got distracted with something else at work. I am still able to
> reproduce the issue, and this is what I have and what I did
> (if you need the core, let me know and I can share it with you)
>
> - 3-cluster RAFT setup in Ubuntu VM (2 VCPUs with 8GB RAM)
>   $ uname -r
>   Linux u1804-HVM-domU 4.15.0-23-generic #25-Ubuntu SMP Wed May 23
> 18:02:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> - On all of the VMs, I have installed openvswitch-switch=2.9.2,
> openvswitch-dbg=2.9.2, and ovn-central=2.9.2
>   (all of these packages are from http://packages.wand.net.nz/)
>
> - I bring up the node in the cluster one after the other -- leader 1st and
> followed by two followers
> - I check for cluster status and everything is healthy
> - ovn-nbctl show and ovn-sbctl show is all empty
>
> - on the leader with OVN_NB_DB set to comma-separated-NB connection
> strings I did
>for i in `seq 1 50`; do ovn-nbclt ls-add ls$i; ovn-nbctl lsp-add ls$i
> port0_$i; done
>
> - Check for the presence of 50 logical switches and 50 logical ports (one
> on each switch). Compact the database on all the nodes.
>
> - Next I try to delete the ports and whilst the deletion is happening I
> run compact on one of the followers
>
>   leader_node# for i in `seq  1 50`; do ovn-nbctl lsp-del port0_$i;done
>   follower_node# ovs-appctl -t /var/run/openvswitch/ovnnb_db.ctl
> ovsdb-server/compact OVN_Northbound
>
> - On the follower node I see the crash:
>
> ● ovn-central.service - LSB: OVN central components
>Loaded: loaded (/etc/init.d/ovn-central; generated)
>Active: active (running) since Thu 2018-07-26 22:48:53 PDT; 19min ago
>  Docs: man:systemd-sysv-generator(8)
>   Process: 21883 ExecStop=/etc/init.d/ovn-central stop (code=exited,
> status=0/SUCCESS)
>   Process: 21934 ExecStart=/etc/init.d/ovn-central start (code=exited,
> status=0/SUCCESS)
> Tasks: 10 (limit: 4915)
>CGroup: /system.slice/ovn-central.service
>├─22047 ovsdb-server: monitoring pid 22134 (*1 crashes: pid
> 22048 died, killed (Aborted), core dumped*
>├─22059 ovsdb-server: monitoring pid 22060 (healthy)
>├─22060 ovsdb-server -vconsole:off -vfile:info
> --log-file=/var/log/openvswitch/ovsdb-server-sb.log -
>├─22072 ovn-northd: monitoring pid 22073 (healthy)
>├─22073 ovn-northd -vconsole:emer -vsyslog:err -vfile:info
> --ovnnb-db=tcp:10.0.7.33:6641,tcp:10.0.7.
>└─22134 ovsdb-server -vconsole:off -vfile:info
> --log-file=/var/log/openvswitch/ovsdb-server-nb.log
>
>
> Same call trace and reason:
>
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x7f79599a1801 in __GI_abort () at abort.c:79
> #2  0x5596879c017c in json_serialize (json=,
> s=) at ../lib/json.c:1554
> #3  0x5596879c01eb in json_serialize_object_member (i=,
> s=, node=, node=) at
> ../lib/json.c:1583
> #4  0x5596879c0132 in json_serialize_object (s=0x7ffc17013bf0,
> object=0x55968993dcb0) at ../lib/json.c:1612
> #5  json_serialize (json=, s=0x7ffc17013bf0) at
> ../lib/json.c:1533
> #6  0x5596879c249c in json_to_ds (json=json@entry=0x559689950670,
> flags=flags@entry=0, ds=ds@entry=0x7ffc17013c80) at ../lib/json.c:1511
> #7  0x5596879ae8df in ovsdb_log_compose_record 
> (json=json@entry=0x559689950670,
> magic=0x55968993dc60 "CLUSTER", header=header@entry=0x7ffc17013c60,
> data=data@entry=0x7ffc17013c80) at ../ovsdb/log.c:570
> #8  0x5596879aebbf in ovsdb_log_write (file=0x5596899b5df0,
> json=0x559689950670) at ../ovsdb/log.c:618
> #9  0x5596879aed3e in ovsdb_log_write_and_free 
> (log=log@entry=0x5596899b5df0,
> json=0x559689950670) at ../ovsdb/log.c:651
> #10 0x5596879b0954 in raft_write_snapshot (raft=raft@entry=0x5596899151a0,
> log=0x5596899b5df0, new_log_start=new_log_start@entry=166,
> new_snapshot=new_snapshot@entry=0x7ffc17013e30) at
> ../ovsdb/raft.c:3588
> #11 0x5596879b0ec3 in raft_save_snapshot (raft=raft@entry=0x5596899151a0,
> new_start=new_start@entry=166, new_snapshot=new_snapshot@
> entry=0x7ffc17013e30)
> at ../ovsdb/raft.c:3647
> #12 0x5596879b8aed in raft_store_snapshot (raft=0x5596899151a0,
> new_snapshot_data=new_snapshot_data@entry=0x5596899505f0) at
> ../ovsdb/raft.c:3849
> #13 0x5596879a579e in ovsdb_storage_store_snapshot__
> (storage=0x5596899137a0, schema=0x559689938ca0, data=0x559689946ea0) at
> ../ovsdb/storage.c:541
> #14 0x5596879a625e in ovsdb_storage_store_snapshot
> (storage=0x5596899137a0, schema=schema@entry=0x559689938ca0,
> data=data@entry=0x559689946ea0) at ../ovsdb/storage.c:568
> #15 0x55968799f5ab in ovsdb_snapshot (db=0x5596899137e0) at
> ../ovsdb/ovsdb.c:519
> #16 

Re: [ovs-dev] [ovs-discuss] ovsdb-server core dump and ovsdb corruption using raft cluster

2018-07-27 Thread Girish Moodalbail
Hello Ben,

Sorry, got distracted with something else at work. I am still able to
reproduce the issue, and this is what I have and what I did
(if you need the core, let me know and I can share it with you)

- 3-cluster RAFT setup in Ubuntu VM (2 VCPUs with 8GB RAM)
  $ uname -r
  Linux u1804-HVM-domU 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 18:02:16
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

- On all of the VMs, I have installed openvswitch-switch=2.9.2,
openvswitch-dbg=2.9.2, and ovn-central=2.9.2
  (all of these packages are from http://packages.wand.net.nz/)

- I bring up the node in the cluster one after the other -- leader 1st and
followed by two followers
- I check for cluster status and everything is healthy
- ovn-nbctl show and ovn-sbctl show is all empty

- on the leader with OVN_NB_DB set to comma-separated-NB connection strings
I did
   for i in `seq 1 50`; do ovn-nbclt ls-add ls$i; ovn-nbctl lsp-add ls$i
port0_$i; done

- Check for the presence of 50 logical switches and 50 logical ports (one
on each switch). Compact the database on all the nodes.

- Next I try to delete the ports and whilst the deletion is happening I run
compact on one of the followers

  leader_node# for i in `seq  1 50`; do ovn-nbctl lsp-del port0_$i;done
  follower_node# ovs-appctl -t /var/run/openvswitch/ovnnb_db.ctl
ovsdb-server/compact OVN_Northbound

- On the follower node I see the crash:

● ovn-central.service - LSB: OVN central components
   Loaded: loaded (/etc/init.d/ovn-central; generated)
   Active: active (running) since Thu 2018-07-26 22:48:53 PDT; 19min ago
 Docs: man:systemd-sysv-generator(8)
  Process: 21883 ExecStop=/etc/init.d/ovn-central stop (code=exited,
status=0/SUCCESS)
  Process: 21934 ExecStart=/etc/init.d/ovn-central start (code=exited,
status=0/SUCCESS)
Tasks: 10 (limit: 4915)
   CGroup: /system.slice/ovn-central.service
   ├─22047 ovsdb-server: monitoring pid 22134 (*1 crashes: pid
22048 died, killed (Aborted), core dumped*
   ├─22059 ovsdb-server: monitoring pid 22060 (healthy)
   ├─22060 ovsdb-server -vconsole:off -vfile:info
--log-file=/var/log/openvswitch/ovsdb-server-sb.log -
   ├─22072 ovn-northd: monitoring pid 22073 (healthy)
   ├─22073 ovn-northd -vconsole:emer -vsyslog:err -vfile:info
--ovnnb-db=tcp:10.0.7.33:6641,tcp:10.0.7.
   └─22134 ovsdb-server -vconsole:off -vfile:info
--log-file=/var/log/openvswitch/ovsdb-server-nb.log


Same call trace and reason:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7f79599a1801 in __GI_abort () at abort.c:79
#2  0x5596879c017c in json_serialize (json=,
s=) at ../lib/json.c:1554
#3  0x5596879c01eb in json_serialize_object_member (i=,
s=, node=, node=) at
../lib/json.c:1583
#4  0x5596879c0132 in json_serialize_object (s=0x7ffc17013bf0,
object=0x55968993dcb0) at ../lib/json.c:1612
#5  json_serialize (json=, s=0x7ffc17013bf0) at
../lib/json.c:1533
#6  0x5596879c249c in json_to_ds (json=json@entry=0x559689950670,
flags=flags@entry=0, ds=ds@entry=0x7ffc17013c80) at ../lib/json.c:1511
#7  0x5596879ae8df in ovsdb_log_compose_record
(json=json@entry=0x559689950670,
magic=0x55968993dc60 "CLUSTER", header=header@entry=0x7ffc17013c60,
data=data@entry=0x7ffc17013c80) at ../ovsdb/log.c:570
#8  0x5596879aebbf in ovsdb_log_write (file=0x5596899b5df0,
json=0x559689950670) at ../ovsdb/log.c:618
#9  0x5596879aed3e in ovsdb_log_write_and_free
(log=log@entry=0x5596899b5df0,
json=0x559689950670) at ../ovsdb/log.c:651
#10 0x5596879b0954 in raft_write_snapshot (raft=raft@entry=0x5596899151a0,
log=0x5596899b5df0, new_log_start=new_log_start@entry=166,
new_snapshot=new_snapshot@entry=0x7ffc17013e30) at ../ovsdb/raft.c:3588
#11 0x5596879b0ec3 in raft_save_snapshot (raft=raft@entry=0x5596899151a0,
new_start=new_start@entry=166, new_snapshot=new_snapshot@entry
=0x7ffc17013e30)
at ../ovsdb/raft.c:3647
#12 0x5596879b8aed in raft_store_snapshot (raft=0x5596899151a0,
new_snapshot_data=new_snapshot_data@entry=0x5596899505f0) at
../ovsdb/raft.c:3849
#13 0x5596879a579e in ovsdb_storage_store_snapshot__
(storage=0x5596899137a0, schema=0x559689938ca0, data=0x559689946ea0) at
../ovsdb/storage.c:541
#14 0x5596879a625e in ovsdb_storage_store_snapshot
(storage=0x5596899137a0, schema=schema@entry=0x559689938ca0,
data=data@entry=0x559689946ea0)
at ../ovsdb/storage.c:568
#15 0x55968799f5ab in ovsdb_snapshot (db=0x5596899137e0) at
../ovsdb/ovsdb.c:519
#16 0x559687999f23 in ovsdb_server_compact (conn=0x559689938440,
argc=, argv=, dbs_=0x7ffc170141c0) at
../ovsdb/ovsdb-server.c:1443
#17 0x5596879d9cc0 in process_command (request=,
conn=0x559689938440) at ../lib/unixctl.c:315
#18 run_connection (conn=0x559689938440) at ../lib/unixctl.c:349
#19 unixctl_server_run (server=0x559689937370) at ../lib/unixctl.c:400
#20 0x559687996e1e in main_loop (is_backup=0x7ffc1701412e,
exiting=0x7ffc1701412f, run_process=0x0, 

Re: [ovs-dev] [ovs-discuss] ovsdb-server core dump and ovsdb corruption using raft cluster

2018-07-25 Thread Ben Pfaff
On Wed, Jul 18, 2018 at 10:48:08AM -0700, Girish Moodalbail wrote:
> Hello all,
> 
> We are able to reproduce this issue on OVS 2.9.2 at will. The OVSDB NB
> server or OVSDB SB server dumps core while it is trying to compact the
> database.
> 
> You can reproduce the issue by using:
> 
> root@u1804-HVM-domU:/var/crash# ovs-appctl -t
> /var/run/openvswitch/ovnsb_db.ctl ovsdb-server/compact OVN_Southbound
> 
> 2018-07-18T17:34:29Z|1|unixctl|WARN|error communicating with
> unix:/var/run/openvswitch/ovnsb_db.ctl: End of file
> ovs-appctl: /var/run/openvswitch/ovnsb_db.ctl: transaction error (End of
> file)

Hmm.  I've now spent some time playing with clustered OVSDB, in 3-server
and 5-server configurations, and triggering compaction at various points
while starting and stopping servers.  But I haven't yet managed to
trigger this crash.

Is there anything else that seems to be an important element?

Thanks,

Ben.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [ovs-discuss] ovsdb-server core dump and ovsdb corruption using raft cluster

2018-07-25 Thread Girish Moodalbail
On one occasion, we saw ovsdb-server crash with a similar stack (i.e., at
the same location but different code path). This time, the raft node
(leader) was trying to reply to a follower
with a snapshot of its database (I am guessing from the stack trace, so I
might be way off here). See below, for the stack trace:

What I could glean is that struct raft`snap`servers has some invalid data.
The code is trying to read off of this memory to add Header to the JSON
data before it sends out to the Follower.

(gdb) ptype struct raft
type = struct raft {
   
struct raft_entry snap;
   
}
(gdb) print (*((struct raft *)0x558e29da54a0))->snap
$8 = {
  term = 1,
  data = 0x558e29e0ee10,
  eid = {
parts = {2565672547, 1297826449, 2363976528, 2929681189}
  },
  servers = 0x558e29dbe7a0  < This one is junk
}

[Stack Trace below]

$ gdb /usr/sbin/ovsdb-server core
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/ovsdb-server...Reading symbols from
/usr/lib/debug/.build-id/4f/0be3920e0ce0ed5603c301f294c3d36392a187.debug...done.
done.
[New LWP 2832]
[New LWP 2834]
[New LWP 3933]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `ovsdb-server -vconsole:off -vfile:info
--log-file=/var/log/openvswitch/ovsdb-se'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f0791af2ec0 (LWP 2832))]
(gdb)
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7f0790c83801 in __GI_abort () at abort.c:79
#2  0x7f07912ed17c in json_serialize (json=,
s=) at lib/json.c:1554
#3  0x7f07912ed1eb in json_serialize_object_member (i=,
s=, node=,
node=) at lib/json.c:1583
#4  0x7f07912ed132 in json_serialize_object (s=0x7ffc22096eb0,
object=0x558e29da4120) at lib/json.c:1612
#5  json_serialize (json=, s=0x7ffc22096eb0) at
lib/json.c:1533
#6  0x7f07912ecf75 in json_serialize_array (array=0x558e29df12b0,
array=0x558e29df12b0, s=0x7ffc22096eb0) at lib/json.c:1637
#7  json_serialize (json=0x558e29df12a0, s=0x7ffc22096eb0) at
lib/json.c:1537
#8  0x7f07912ed1eb in json_serialize_object_member (i=,
s=, node=,
node=) at lib/json.c:1583
#9  0x7f07912ed132 in json_serialize_object (s=0x7ffc22096eb0,
object=0x558e29e10450) at lib/json.c:1612
#10 json_serialize (json=, s=0x7ffc22096eb0) at
lib/json.c:1533
#11 0x7f07912ef47c in json_to_ds (json=json@entry=0x558e29e046d0,
flags=flags@entry=0, ds=ds@entry=0x7ffc22096ee0)
at lib/json.c:1511
#12 0x7f07912f0658 in jsonrpc_send (rpc=0x558e29e101b0, msg=) at lib/jsonrpc.c:253
#13 0x7f07912f104e in jsonrpc_session_send (s=,
msg=) at lib/jsonrpc.c:1095
#14 0x7f07916b40ac in raft_send__ (raft=raft@entry=0x558e29da54a0,
rpc=rpc@entry=0x7ffc22096f90,
conn=conn@entry=0x558e29dff500) at ovsdb/raft.c:4009
#15 0x7f07916b45d0 in raft_send (raft=raft@entry=0x558e29da54a0,
rpc=rpc@entry=0x7ffc22096f90) at ovsdb/raft.c:4050
#16 0x7f07916b4c0c in raft_send_install_snapshot_request
(raft=raft@entry=0x558e29da54a0, comment=comment@entry=0x0,
s=) at ovsdb/raft.c:3059
#17 0x7f07916baa7c in raft_handle_append_reply (rpy=0x7ffc22097030,
raft=0x558e29da54a0) at ovsdb/raft.c:3134
#18 raft_handle_rpc (rpc=0x7ffc22097030, raft=0x558e29da54a0) at
ovsdb/raft.c:3987
#19 raft_conn_run (raft=raft@entry=0x558e29da54a0,
conn=conn@entry=0x558e29dff500)
at ovsdb/raft.c:1371
#20 0x7f07916baeeb in raft_run (raft=0x558e29da54a0) at
ovsdb/raft.c:1724
#21 0x558e29781ebd in main_loop (is_backup=0x7ffc220972ae,
exiting=0x7ffc220972af, run_process=0x0, remotes=0x7ffc22097300,
unixctl=0x558e29de9560, all_dbs=0x7ffc22097340, jsonrpc=0x558e29da5420,
config=0x7ffc22097360) at ovsdb/ovsdb-server.c:230
#22 main (argc=, argv=) at
ovsdb/ovsdb-server.c:457
(gdb) *((struct shash *)0x558e29da4120)
Undefined command: "".  Try "help".
(gdb) print *((struct shash *)0x558e29da4120)
$1 = {map = {buckets = 0x558e29e04480, one = 0x0, mask = 7, n = 9}}
(gdb) set print pretty on
(gdb) set print elements 0
(gdb) print *((struct shash *)0x558e29da4120)
$2 = {
  map = {
buckets = 

Re: [ovs-dev] [ovs-discuss] ovsdb-server core dump and ovsdb corruption using raft cluster

2018-07-24 Thread aginwala
Hi:

Glad to see more people picking up on raft testing.

Just to add on, you can also refer to
https://mail.openvswitch.org/pipermail/ovs-dev/2018-May/347765.html and
https://mail.openvswitch.org/pipermail/ovs-dev/2018-April/346375.html  where
there are couple of suggestions given by Ben too. See if you can skip
snapshot code  and still see the error. However,  the ask to skip snapshot
was to see if the performance would improve for testing purpose. I remember
tuning my VM memory, vcpus ,etc. and never ran into core dump issue again.



Regards,


On Tue, Jul 24, 2018 at 4:41 PM Yifeng Sun  wrote:

> My apologize, the patch has some issue. I need to dig further.
>
> Yifeng
>
> On Tue, Jul 24, 2018 at 1:40 PM, Yifeng Sun 
> wrote:
>
> > Hi Yun and Girish,
> >
> > I submitted a patch, do you mind testing and reviewing it? Thanks.
> >
> > [PATCH] dynamic-string: Fix a bug that leads to assertion fail
> >
> > diff --git a/lib/dynamic-string.c b/lib/dynamic-string.c
> > index 6f7b610a9908..4564e420544d 100644
> > --- a/lib/dynamic-string.c
> > +++ b/lib/dynamic-string.c
> > @@ -158,7 +158,7 @@ ds_put_format_valist(struct ds *ds, const char
> > *format, va_list args_)
> >  if (needed < available) {
> >  ds->length += needed;
> >  } else {
> > -ds_reserve(ds, ds->length + needed);
> > +ds_reserve(ds, ds->allocated + needed);
> >
> >  va_copy(args, args_);
> >  available = ds->allocated - ds->length + 1;
> >
> >
> > Thanks,
> > Yifeng Sun
> >
> > On Wed, Jul 18, 2018 at 10:48 AM, Girish Moodalbail <
> gmoodalb...@gmail.com
> > > wrote:
> >
> >> Hello all,
> >>
> >> We are able to reproduce this issue on OVS 2.9.2 at will. The OVSDB NB
> >> server or OVSDB SB server dumps core while it is trying to compact the
> >> database.
> >>
> >> You can reproduce the issue by using:
> >>
> >> root@u1804-HVM-domU:/var/crash# ovs-appctl -t
> >> /var/run/openvswitch/ovnsb_db.ctl ovsdb-server/compact OVN_Southbound
> >>
> >> 2018-07-18T17:34:29Z|1|unixctl|WARN|error communicating with
> >> unix:/var/run/openvswitch/ovnsb_db.ctl: End of file
> >> ovs-appctl: /var/run/openvswitch/ovnsb_db.ctl: transaction error (End of
> >> file)
> >> root@u1804-HVM-domU:/var/crash#
> >> root@u1804-HVM-domU:/var/crash#
> >> root@u1804-HVM-domU:/var/crash# ERROR: apport (pid 17393) Wed Jul 18
> >> 10:34:23 2018: called for pid 14683, signal 6, core limit 0, dump mode 1
> >> ERROR: apport (pid 17393) Wed Jul 18 10:34:23 2018: executable:
> >> /usr/sbin/ovsdb-server (command line "ovsdb-server -vconsole:off
> >> -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log
> >> --remote=punix:/var/run/openvswitch/ovnsb_db.sock
> >> --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl
> >> --detach
> >> --monitor --remote=db:OVN_Southbound,SB_Global,connections
> >> --private-key=db:OVN_Southbound,SSL,private_key
> >> --certificate=db:OVN_Southbound,SSL,certificate
> >> --ca-cert=db:OVN_Southbound,SSL,ca_cert
> >> --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols
> >> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers
> >> --remote=ptcp:6642:10.0.7.33 /etc/openvswitch/ovnsb_db.db")
> >> ERROR: apport (pid 17393) Wed Jul 18 10:34:23 2018:
> is_closing_session():
> >> no DBUS_SESSION_BUS_ADDRESS in environment
> >> ERROR: apport (pid 17393) Wed Jul 18 10:34:29 2018: wrote report
> >> /var/crash/_usr_sbin_ovsdb-server.0.crash
> >>
> >> Looking through the crash we see the following stack:
> >>
> >> (gdb) bt
> >> #0  __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:51
> >> #1  0x7f7c9a43c801 in __GI_abort () at abort.c:79
> >> #2  0x7f7c9aaa633c in json_serialize (json=,
> >> s=) at lib/json.c:1554
> >> #3  0x7f7c9aaa63ab in json_serialize_object_member (i= out>,
> >> s=, node=, node=)
> >> at lib/json.c:1583
> >> #4  0x7f7c9aaa62f2 in json_serialize_object (s=0x7ffca2173ea0,
> >> object=0x5568dc5d5b10) at lib/json.c:1612
> >> #5  json_serialize (json=, s=0x7ffca2173ea0) at
> >> lib/json.c:1533
> >> #6  0x7f7c9aaa863c in json_to_ds (json=json@entry=0x5568dc5d4a20,
> >> flags=flags@entry=0, ds=ds@entry=0x7ffca2173f30) at lib/json.c:1511
> >> #7  0x7f7c9ae6750f in ovsdb_log_compose_record
> >> (json=json@entry=0x5568dc5d4a20,
> >> magic=0x5568dc5d5a60 "CLUSTER",
> >> header=header@entry=0x7ffca2173f10, data=data@entry=0x7ffca2173f30)
> >> at
> >> ovsdb/log.c:570
> >> #8  0x7f7c9ae677ef in ovsdb_log_write (file=0x5568dc5d5a80,
> >> json=0x5568dc5d4a20) at ovsdb/log.c:618
> >> #9  0x7f7c9ae6796e in ovsdb_log_write_and_free
> >> (log=log@entry=0x5568dc5d5a80,
> >> json=0x5568dc5d4a20) at ovsdb/log.c:651
> >> #10 0x7f7c9ae6d684 in raft_write_snapshot (raft=raft@entry
> >> =0x5568dc1e3720,
> >> log=0x5568dc5d5a80, new_log_start=new_log_start@entry=539578,
> >> new_snapshot=new_snapshot@entry=0x7ffca21740e0) at
> ovsdb/raft.c:3588
> >> #11 0x7f7c9ae6dbf3 in raft_save_snapshot (raft=raft@entry
> >> =0x5568dc1e3720,
> >> 

Re: [ovs-dev] [ovs-discuss] ovsdb-server core dump and ovsdb corruption using raft cluster

2018-07-24 Thread Yifeng Sun
My apologize, the patch has some issue. I need to dig further.

Yifeng

On Tue, Jul 24, 2018 at 1:40 PM, Yifeng Sun  wrote:

> Hi Yun and Girish,
>
> I submitted a patch, do you mind testing and reviewing it? Thanks.
>
> [PATCH] dynamic-string: Fix a bug that leads to assertion fail
>
> diff --git a/lib/dynamic-string.c b/lib/dynamic-string.c
> index 6f7b610a9908..4564e420544d 100644
> --- a/lib/dynamic-string.c
> +++ b/lib/dynamic-string.c
> @@ -158,7 +158,7 @@ ds_put_format_valist(struct ds *ds, const char
> *format, va_list args_)
>  if (needed < available) {
>  ds->length += needed;
>  } else {
> -ds_reserve(ds, ds->length + needed);
> +ds_reserve(ds, ds->allocated + needed);
>
>  va_copy(args, args_);
>  available = ds->allocated - ds->length + 1;
>
>
> Thanks,
> Yifeng Sun
>
> On Wed, Jul 18, 2018 at 10:48 AM, Girish Moodalbail  > wrote:
>
>> Hello all,
>>
>> We are able to reproduce this issue on OVS 2.9.2 at will. The OVSDB NB
>> server or OVSDB SB server dumps core while it is trying to compact the
>> database.
>>
>> You can reproduce the issue by using:
>>
>> root@u1804-HVM-domU:/var/crash# ovs-appctl -t
>> /var/run/openvswitch/ovnsb_db.ctl ovsdb-server/compact OVN_Southbound
>>
>> 2018-07-18T17:34:29Z|1|unixctl|WARN|error communicating with
>> unix:/var/run/openvswitch/ovnsb_db.ctl: End of file
>> ovs-appctl: /var/run/openvswitch/ovnsb_db.ctl: transaction error (End of
>> file)
>> root@u1804-HVM-domU:/var/crash#
>> root@u1804-HVM-domU:/var/crash#
>> root@u1804-HVM-domU:/var/crash# ERROR: apport (pid 17393) Wed Jul 18
>> 10:34:23 2018: called for pid 14683, signal 6, core limit 0, dump mode 1
>> ERROR: apport (pid 17393) Wed Jul 18 10:34:23 2018: executable:
>> /usr/sbin/ovsdb-server (command line "ovsdb-server -vconsole:off
>> -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log
>> --remote=punix:/var/run/openvswitch/ovnsb_db.sock
>> --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl
>> --detach
>> --monitor --remote=db:OVN_Southbound,SB_Global,connections
>> --private-key=db:OVN_Southbound,SSL,private_key
>> --certificate=db:OVN_Southbound,SSL,certificate
>> --ca-cert=db:OVN_Southbound,SSL,ca_cert
>> --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols
>> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers
>> --remote=ptcp:6642:10.0.7.33 /etc/openvswitch/ovnsb_db.db")
>> ERROR: apport (pid 17393) Wed Jul 18 10:34:23 2018: is_closing_session():
>> no DBUS_SESSION_BUS_ADDRESS in environment
>> ERROR: apport (pid 17393) Wed Jul 18 10:34:29 2018: wrote report
>> /var/crash/_usr_sbin_ovsdb-server.0.crash
>>
>> Looking through the crash we see the following stack:
>>
>> (gdb) bt
>> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
>> #1  0x7f7c9a43c801 in __GI_abort () at abort.c:79
>> #2  0x7f7c9aaa633c in json_serialize (json=,
>> s=) at lib/json.c:1554
>> #3  0x7f7c9aaa63ab in json_serialize_object_member (i=,
>> s=, node=, node=)
>> at lib/json.c:1583
>> #4  0x7f7c9aaa62f2 in json_serialize_object (s=0x7ffca2173ea0,
>> object=0x5568dc5d5b10) at lib/json.c:1612
>> #5  json_serialize (json=, s=0x7ffca2173ea0) at
>> lib/json.c:1533
>> #6  0x7f7c9aaa863c in json_to_ds (json=json@entry=0x5568dc5d4a20,
>> flags=flags@entry=0, ds=ds@entry=0x7ffca2173f30) at lib/json.c:1511
>> #7  0x7f7c9ae6750f in ovsdb_log_compose_record
>> (json=json@entry=0x5568dc5d4a20,
>> magic=0x5568dc5d5a60 "CLUSTER",
>> header=header@entry=0x7ffca2173f10, data=data@entry=0x7ffca2173f30)
>> at
>> ovsdb/log.c:570
>> #8  0x7f7c9ae677ef in ovsdb_log_write (file=0x5568dc5d5a80,
>> json=0x5568dc5d4a20) at ovsdb/log.c:618
>> #9  0x7f7c9ae6796e in ovsdb_log_write_and_free
>> (log=log@entry=0x5568dc5d5a80,
>> json=0x5568dc5d4a20) at ovsdb/log.c:651
>> #10 0x7f7c9ae6d684 in raft_write_snapshot (raft=raft@entry
>> =0x5568dc1e3720,
>> log=0x5568dc5d5a80, new_log_start=new_log_start@entry=539578,
>> new_snapshot=new_snapshot@entry=0x7ffca21740e0) at ovsdb/raft.c:3588
>> #11 0x7f7c9ae6dbf3 in raft_save_snapshot (raft=raft@entry
>> =0x5568dc1e3720,
>> new_start=new_start@entry=539578,
>> new_snapshot=new_snapshot@entry=0x7ffca21740e0) at ovsdb/raft.c:3647
>> #12 0x7f7c9ae757bd in raft_store_snapshot (raft=0x5568dc1e3720,
>> new_snapshot_data=new_snapshot_data@entry=0x5568dc5d49a0)
>> at ovsdb/raft.c:3849
>> #13 0x7f7c9ae7c7ae in ovsdb_storage_store_snapshot__
>> (storage=0x5568dc6b2fb0, schema=0x5568dd66f5a0, data=0x5568dca67880)
>> at ovsdb/storage.c:541
>> #14 0x7f7c9ae7d1de in ovsdb_storage_store_snapshot
>> (storage=0x5568dc6b2fb0, schema=schema@entry=0x5568dd66f5a0,
>> data=data@entry=0x5568dca67880) at ovsdb/storage.c:568
>> #15 0x7f7c9ae69cab in ovsdb_snapshot (db=0x5568dc6b3020) at
>> ovsdb/ovsdb.c:519
>> #16 0x5568daec1f82 in main_loop (is_backup=0x7ffca21742be,
>> exiting=0x7ffca21742bf, run_process=0x0, remotes=0x7ffca2174310,
>> 

Re: [ovs-dev] [ovs-discuss] ovsdb-server core dump and ovsdb corruption using raft cluster

2018-07-24 Thread Yifeng Sun
Hi Yun and Girish,

I submitted a patch, do you mind testing and reviewing it? Thanks.

[PATCH] dynamic-string: Fix a bug that leads to assertion fail

diff --git a/lib/dynamic-string.c b/lib/dynamic-string.c
index 6f7b610a9908..4564e420544d 100644
--- a/lib/dynamic-string.c
+++ b/lib/dynamic-string.c
@@ -158,7 +158,7 @@ ds_put_format_valist(struct ds *ds, const char *format,
va_list args_)
 if (needed < available) {
 ds->length += needed;
 } else {
-ds_reserve(ds, ds->length + needed);
+ds_reserve(ds, ds->allocated + needed);

 va_copy(args, args_);
 available = ds->allocated - ds->length + 1;


Thanks,
Yifeng Sun

On Wed, Jul 18, 2018 at 10:48 AM, Girish Moodalbail 
wrote:

> Hello all,
>
> We are able to reproduce this issue on OVS 2.9.2 at will. The OVSDB NB
> server or OVSDB SB server dumps core while it is trying to compact the
> database.
>
> You can reproduce the issue by using:
>
> root@u1804-HVM-domU:/var/crash# ovs-appctl -t
> /var/run/openvswitch/ovnsb_db.ctl ovsdb-server/compact OVN_Southbound
>
> 2018-07-18T17:34:29Z|1|unixctl|WARN|error communicating with
> unix:/var/run/openvswitch/ovnsb_db.ctl: End of file
> ovs-appctl: /var/run/openvswitch/ovnsb_db.ctl: transaction error (End of
> file)
> root@u1804-HVM-domU:/var/crash#
> root@u1804-HVM-domU:/var/crash#
> root@u1804-HVM-domU:/var/crash# ERROR: apport (pid 17393) Wed Jul 18
> 10:34:23 2018: called for pid 14683, signal 6, core limit 0, dump mode 1
> ERROR: apport (pid 17393) Wed Jul 18 10:34:23 2018: executable:
> /usr/sbin/ovsdb-server (command line "ovsdb-server -vconsole:off
> -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log
> --remote=punix:/var/run/openvswitch/ovnsb_db.sock
> --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl
> --detach
> --monitor --remote=db:OVN_Southbound,SB_Global,connections
> --private-key=db:OVN_Southbound,SSL,private_key
> --certificate=db:OVN_Southbound,SSL,certificate
> --ca-cert=db:OVN_Southbound,SSL,ca_cert
> --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols
> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers
> --remote=ptcp:6642:10.0.7.33 /etc/openvswitch/ovnsb_db.db")
> ERROR: apport (pid 17393) Wed Jul 18 10:34:23 2018: is_closing_session():
> no DBUS_SESSION_BUS_ADDRESS in environment
> ERROR: apport (pid 17393) Wed Jul 18 10:34:29 2018: wrote report
> /var/crash/_usr_sbin_ovsdb-server.0.crash
>
> Looking through the crash we see the following stack:
>
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x7f7c9a43c801 in __GI_abort () at abort.c:79
> #2  0x7f7c9aaa633c in json_serialize (json=,
> s=) at lib/json.c:1554
> #3  0x7f7c9aaa63ab in json_serialize_object_member (i=,
> s=, node=, node=)
> at lib/json.c:1583
> #4  0x7f7c9aaa62f2 in json_serialize_object (s=0x7ffca2173ea0,
> object=0x5568dc5d5b10) at lib/json.c:1612
> #5  json_serialize (json=, s=0x7ffca2173ea0) at
> lib/json.c:1533
> #6  0x7f7c9aaa863c in json_to_ds (json=json@entry=0x5568dc5d4a20,
> flags=flags@entry=0, ds=ds@entry=0x7ffca2173f30) at lib/json.c:1511
> #7  0x7f7c9ae6750f in ovsdb_log_compose_record
> (json=json@entry=0x5568dc5d4a20,
> magic=0x5568dc5d5a60 "CLUSTER",
> header=header@entry=0x7ffca2173f10, data=data@entry=0x7ffca2173f30) at
> ovsdb/log.c:570
> #8  0x7f7c9ae677ef in ovsdb_log_write (file=0x5568dc5d5a80,
> json=0x5568dc5d4a20) at ovsdb/log.c:618
> #9  0x7f7c9ae6796e in ovsdb_log_write_and_free
> (log=log@entry=0x5568dc5d5a80,
> json=0x5568dc5d4a20) at ovsdb/log.c:651
> #10 0x7f7c9ae6d684 in raft_write_snapshot (raft=raft@entry=
> 0x5568dc1e3720,
> log=0x5568dc5d5a80, new_log_start=new_log_start@entry=539578,
> new_snapshot=new_snapshot@entry=0x7ffca21740e0) at ovsdb/raft.c:3588
> #11 0x7f7c9ae6dbf3 in raft_save_snapshot (raft=raft@entry=
> 0x5568dc1e3720,
> new_start=new_start@entry=539578,
> new_snapshot=new_snapshot@entry=0x7ffca21740e0) at ovsdb/raft.c:3647
> #12 0x7f7c9ae757bd in raft_store_snapshot (raft=0x5568dc1e3720,
> new_snapshot_data=new_snapshot_data@entry=0x5568dc5d49a0)
> at ovsdb/raft.c:3849
> #13 0x7f7c9ae7c7ae in ovsdb_storage_store_snapshot__
> (storage=0x5568dc6b2fb0, schema=0x5568dd66f5a0, data=0x5568dca67880)
> at ovsdb/storage.c:541
> #14 0x7f7c9ae7d1de in ovsdb_storage_store_snapshot
> (storage=0x5568dc6b2fb0, schema=schema@entry=0x5568dd66f5a0,
> data=data@entry=0x5568dca67880) at ovsdb/storage.c:568
> #15 0x7f7c9ae69cab in ovsdb_snapshot (db=0x5568dc6b3020) at
> ovsdb/ovsdb.c:519
> #16 0x5568daec1f82 in main_loop (is_backup=0x7ffca21742be,
> exiting=0x7ffca21742bf, run_process=0x0, remotes=0x7ffca2174310,
> unixctl=0x5568dc71ade0, all_dbs=0x7ffca2174350, jsonrpc=0x5568dc1e36a0,
> config=0x7ffca2174370) at ovsdb/ovsdb-server.c:239
> #17 main (argc=, argv=) at
> ovsdb/ovsdb-server.c:457
>
> Walking through the JSON objects being serialized we see that
> "prev_servers" is 

Re: [ovs-dev] [ovs-discuss] ovsdb-server core dump and ovsdb corruption using raft cluster

2018-07-18 Thread Girish Moodalbail
Hello all,

We are able to reproduce this issue on OVS 2.9.2 at will. The OVSDB NB
server or OVSDB SB server dumps core while it is trying to compact the
database.

You can reproduce the issue by using:

root@u1804-HVM-domU:/var/crash# ovs-appctl -t
/var/run/openvswitch/ovnsb_db.ctl ovsdb-server/compact OVN_Southbound

2018-07-18T17:34:29Z|1|unixctl|WARN|error communicating with
unix:/var/run/openvswitch/ovnsb_db.ctl: End of file
ovs-appctl: /var/run/openvswitch/ovnsb_db.ctl: transaction error (End of
file)
root@u1804-HVM-domU:/var/crash#
root@u1804-HVM-domU:/var/crash#
root@u1804-HVM-domU:/var/crash# ERROR: apport (pid 17393) Wed Jul 18
10:34:23 2018: called for pid 14683, signal 6, core limit 0, dump mode 1
ERROR: apport (pid 17393) Wed Jul 18 10:34:23 2018: executable:
/usr/sbin/ovsdb-server (command line "ovsdb-server -vconsole:off
-vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log
--remote=punix:/var/run/openvswitch/ovnsb_db.sock
--pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl --detach
--monitor --remote=db:OVN_Southbound,SB_Global,connections
--private-key=db:OVN_Southbound,SSL,private_key
--certificate=db:OVN_Southbound,SSL,certificate
--ca-cert=db:OVN_Southbound,SSL,ca_cert
--ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols
--ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers
--remote=ptcp:6642:10.0.7.33 /etc/openvswitch/ovnsb_db.db")
ERROR: apport (pid 17393) Wed Jul 18 10:34:23 2018: is_closing_session():
no DBUS_SESSION_BUS_ADDRESS in environment
ERROR: apport (pid 17393) Wed Jul 18 10:34:29 2018: wrote report
/var/crash/_usr_sbin_ovsdb-server.0.crash

Looking through the crash we see the following stack:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7f7c9a43c801 in __GI_abort () at abort.c:79
#2  0x7f7c9aaa633c in json_serialize (json=,
s=) at lib/json.c:1554
#3  0x7f7c9aaa63ab in json_serialize_object_member (i=,
s=, node=, node=)
at lib/json.c:1583
#4  0x7f7c9aaa62f2 in json_serialize_object (s=0x7ffca2173ea0,
object=0x5568dc5d5b10) at lib/json.c:1612
#5  json_serialize (json=, s=0x7ffca2173ea0) at
lib/json.c:1533
#6  0x7f7c9aaa863c in json_to_ds (json=json@entry=0x5568dc5d4a20,
flags=flags@entry=0, ds=ds@entry=0x7ffca2173f30) at lib/json.c:1511
#7  0x7f7c9ae6750f in ovsdb_log_compose_record
(json=json@entry=0x5568dc5d4a20,
magic=0x5568dc5d5a60 "CLUSTER",
header=header@entry=0x7ffca2173f10, data=data@entry=0x7ffca2173f30) at
ovsdb/log.c:570
#8  0x7f7c9ae677ef in ovsdb_log_write (file=0x5568dc5d5a80,
json=0x5568dc5d4a20) at ovsdb/log.c:618
#9  0x7f7c9ae6796e in ovsdb_log_write_and_free
(log=log@entry=0x5568dc5d5a80,
json=0x5568dc5d4a20) at ovsdb/log.c:651
#10 0x7f7c9ae6d684 in raft_write_snapshot (raft=raft@entry=0x5568dc1e3720,
log=0x5568dc5d5a80, new_log_start=new_log_start@entry=539578,
new_snapshot=new_snapshot@entry=0x7ffca21740e0) at ovsdb/raft.c:3588
#11 0x7f7c9ae6dbf3 in raft_save_snapshot (raft=raft@entry=0x5568dc1e3720,
new_start=new_start@entry=539578,
new_snapshot=new_snapshot@entry=0x7ffca21740e0) at ovsdb/raft.c:3647
#12 0x7f7c9ae757bd in raft_store_snapshot (raft=0x5568dc1e3720,
new_snapshot_data=new_snapshot_data@entry=0x5568dc5d49a0)
at ovsdb/raft.c:3849
#13 0x7f7c9ae7c7ae in ovsdb_storage_store_snapshot__
(storage=0x5568dc6b2fb0, schema=0x5568dd66f5a0, data=0x5568dca67880)
at ovsdb/storage.c:541
#14 0x7f7c9ae7d1de in ovsdb_storage_store_snapshot
(storage=0x5568dc6b2fb0, schema=schema@entry=0x5568dd66f5a0,
data=data@entry=0x5568dca67880) at ovsdb/storage.c:568
#15 0x7f7c9ae69cab in ovsdb_snapshot (db=0x5568dc6b3020) at
ovsdb/ovsdb.c:519
#16 0x5568daec1f82 in main_loop (is_backup=0x7ffca21742be,
exiting=0x7ffca21742bf, run_process=0x0, remotes=0x7ffca2174310,
unixctl=0x5568dc71ade0, all_dbs=0x7ffca2174350, jsonrpc=0x5568dc1e36a0,
config=0x7ffca2174370) at ovsdb/ovsdb-server.c:239
#17 main (argc=, argv=) at
ovsdb/ovsdb-server.c:457

Walking through the JSON objects being serialized we see that
"prev_servers" is malformed.

(gdb) print *((struct shash *)0x5568dc5d5b10)
$3 = {
  map = {
buckets = 0x5568dc5d1d30,
one = 0x0,
mask = 7,
n = 9
  }
}

(gdb) x/6a 0x5568dc5d1d30
0x5568dc5d1d30:0x5568dc5d60000x0
0x5568dc5d1d40:0x00x5568dc5d5f30
0x5568dc5d1d50:0x5568dc5d5e300x5568dc5d5bc0

Let us look at the next one

(gdb) print *((struct shash_node *)0x5568dc5d5e30)
$7 = {
  node = {
hash = 2043875868,
next = 0x0
  },
  name = 0x5568dc5d5e10 "prev_servers",
  data = 0x5568dc688cd0
}

(gdb) print *((struct json *)0x5568dc688cd0)
$10 = {
  type = 3697839232,
  count = 34,
  u = {
object = 0x5568dc688cb0,
array = {
  n = 93908862799024,
  n_allocated = 93908862798944,
  elems = 0x5568dc22f050
},
integer = 93908862799024,
real = 4.6397142949016804e-310,
string = 0x5568dc688cb0 "\a"
  }
}

So, this is malformed. Somehow "prev_servers" is getting