Hi Florin, If I repeat that test exactly as you ran it, I see the same results as you did. With a slight modification, the situation I described shows up:
1. systemctl start vpp 2. start vat, execute sw_interface_dump 3. leave vat running, in another terminal run systemctl restart vpp 4. in still-running vat, execute ip_address_dump ipv4 sw_if_index 1 5. quit vat 6. start vat Basically, get vat to send a message after vpp has been restarted. Step 4 shows this error and then the vat prompt returns: ip_address_dump error: Misc Step 5 shows this and returns me to the shell: main:446: BUG: message reply spin-wait timeout vl_client_disconnect:301: peer unresponsive, give up Step 6 hangs for a couple of minutes and then prints: vl_map_shmem:639: region init fail connect_to_vlib_internal:398: vl_client_api map rv -2 Couldn't connect to vpe, exiting… Are you able to reproduce this? Thanks! -Matt > On Jan 26, 2018, at 4:54 PM, Florin Coras <fcoras.li...@gmail.com> wrote: > > Hi Matt, > > I tried reproducing this with vpp + vat. Is this a fair equivalent scenario? > > 1. Start vpp and attach vpp_api_test and send some msg > 2. Restart vpp and stop vat > 3. Restart vat and send message. > > The thing is, off of master, this works for me. > > Thanks, > Florin > >> On Jan 26, 2018, at 2:31 PM, Matthew Smith <mgsm...@netgate.com> wrote: >> >> >> Hi all, >> >> I have a few applications that use the shared memory API. I’m running these >> on CentOS 7.4, and starting VPP using systemd. If VPP happens to crash or be >> intentionally restarted, those applications never seem to recover their API >> connection. They notice that the original VPP process died and try to call >> vl_client_disconnect_from_vlib(). That call tries to send API messages to >> cleanly shut down its connection. The application will time out waiting for >> a response, write a message like: >> >> 'vl_client_disconnect:301: peer unresponsive, give up >> >> and eventually consider itself disconnected. When it tries to reconnect, it >> hangs for a while (100 seconds on the last occurrence I checked on) and then >> prints messages like: >> >> vl_map_shmem:619: region init fail >> connect_to_vlib_internal:394: vl_client_api map rv -2 >> >> The client keeps on trying and continues seeing those same errors. If the >> client is restarted, it sees the same errors after restart. It doesn’t >> recover until VPP is restarted with the client stopped. Once that happens, >> the client can be started again and successfully connect. >> >> The VPP systemd service file that is installed with RPMs built via ‘make >> pkg-rpm' has the following: >> >> [Service] >> ExecStartPre=-/bin/rm -f /dev/shm/db /dev/shm/global_vm /dev/shm/vpe-api >> >> When systemd starts VPP, it removes these files which the still-running >> client applications have run shm_open/mmap on. I am guessing that when those >> clients try to disconnect with vl_client_disconnect_from_vlib(), they are >> stomping on something in shared memory that subsequently keeps them from >> being able to connect. If I comment that command from the systemd service >> definition, the problem behavior I described above disappears. The >> applications write one ‘peer unresponsive’ message and then they reconnect >> to the API successfully and all is (relatively) well. This also is the case >> if I don’t start VPP with systemd/systemctl and just run /usr/bin/vpp >> directly. >> >> Does anyone have any thoughts on whether it would be ok to remove that >> command from the systemd service file? Or is there some other better way to >> deal with VPP crashing from the perspective of a client to the shared memory >> API? >> >> Thanks! >> -Matt >> >> _______________________________________________ >> vpp-dev mailing list >> vpp-dev@lists.fd.io >> https://lists.fd.io/mailman/listinfo/vpp-dev > _______________________________________________ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev