Hi Florin, I rebuilt VPP with your patch applied. It looks like it works. I restarted VPP while one of my applications was connected to the API. Then I created some activity that would force the application to send a message the API. It logged a timeout on sending a message and then disconnected and reconnected successfully and the connection worked properly after that.
Thanks! -Matt > On Jan 29, 2018, at 3:19 PM, Florin Coras <fcoras.li...@gmail.com> wrote: > > Ow, I’m guilty of having “manually restarted” vpp so I completely avoided the > segment cleanup ... > > I can’t yet figure out why, but it seems that doing vl_client_api_unmap when > vpp does not respond leads to breakage. Could you try the quick fix here [1] > and see if it fixes your issue? > > Cheers, > Florin > > [1] https://gerrit.fd.io/r/#/c/10315/ <https://gerrit.fd.io/r/#/c/10315/> > >> On Jan 29, 2018, at 11:26 AM, Matthew Smith <mgsm...@netgate.com >> <mailto:mgsm...@netgate.com>> wrote: >> >> Hi Florin, >> >> If I repeat that test exactly as you ran it, I see the same results as you >> did. With a slight modification, the situation I described shows up: >> >> 1. systemctl start vpp >> 2. start vat, execute sw_interface_dump >> 3. leave vat running, in another terminal run systemctl restart vpp >> 4. in still-running vat, execute ip_address_dump ipv4 sw_if_index 1 >> 5. quit vat >> 6. start vat >> >> Basically, get vat to send a message after vpp has been restarted. >> >> Step 4 shows this error and then the vat prompt returns: >> >> ip_address_dump error: Misc >> >> Step 5 shows this and returns me to the shell: >> >> main:446: BUG: message reply spin-wait timeout >> vl_client_disconnect:301: peer unresponsive, give up >> >> Step 6 hangs for a couple of minutes and then prints: >> >> vl_map_shmem:639: region init fail >> connect_to_vlib_internal:398: vl_client_api map rv -2 >> Couldn't connect to vpe, exiting… >> >> >> >> Are you able to reproduce this? >> >> Thanks! >> -Matt >> >> >> >>> On Jan 26, 2018, at 4:54 PM, Florin Coras <fcoras.li...@gmail.com >>> <mailto:fcoras.li...@gmail.com>> wrote: >>> >>> Hi Matt, >>> >>> I tried reproducing this with vpp + vat. Is this a fair equivalent scenario? >>> >>> 1. Start vpp and attach vpp_api_test and send some msg >>> 2. Restart vpp and stop vat >>> 3. Restart vat and send message. >>> >>> The thing is, off of master, this works for me. >>> >>> Thanks, >>> Florin >>> >>>> On Jan 26, 2018, at 2:31 PM, Matthew Smith <mgsm...@netgate.com >>>> <mailto:mgsm...@netgate.com>> wrote: >>>> >>>> >>>> Hi all, >>>> >>>> I have a few applications that use the shared memory API. I’m running >>>> these on CentOS 7.4, and starting VPP using systemd. If VPP happens to >>>> crash or be intentionally restarted, those applications never seem to >>>> recover their API connection. They notice that the original VPP process >>>> died and try to call vl_client_disconnect_from_vlib(). That call tries to >>>> send API messages to cleanly shut down its connection. The application >>>> will time out waiting for a response, write a message like: >>>> >>>> 'vl_client_disconnect:301: peer unresponsive, give up >>>> >>>> and eventually consider itself disconnected. When it tries to reconnect, >>>> it hangs for a while (100 seconds on the last occurrence I checked on) and >>>> then prints messages like: >>>> >>>> vl_map_shmem:619: region init fail >>>> connect_to_vlib_internal:394: vl_client_api map rv -2 >>>> >>>> The client keeps on trying and continues seeing those same errors. If the >>>> client is restarted, it sees the same errors after restart. It doesn’t >>>> recover until VPP is restarted with the client stopped. Once that happens, >>>> the client can be started again and successfully connect. >>>> >>>> The VPP systemd service file that is installed with RPMs built via ‘make >>>> pkg-rpm' has the following: >>>> >>>> [Service] >>>> ExecStartPre=-/bin/rm -f /dev/shm/db /dev/shm/global_vm /dev/shm/vpe-api >>>> >>>> When systemd starts VPP, it removes these files which the still-running >>>> client applications have run shm_open/mmap on. I am guessing that when >>>> those clients try to disconnect with vl_client_disconnect_from_vlib(), >>>> they are stomping on something in shared memory that subsequently keeps >>>> them from being able to connect. If I comment that command from the >>>> systemd service definition, the problem behavior I described above >>>> disappears. The applications write one ‘peer unresponsive’ message and >>>> then they reconnect to the API successfully and all is (relatively) well. >>>> This also is the case if I don’t start VPP with systemd/systemctl and just >>>> run /usr/bin/vpp directly. >>>> >>>> Does anyone have any thoughts on whether it would be ok to remove that >>>> command from the systemd service file? Or is there some other better way >>>> to deal with VPP crashing from the perspective of a client to the shared >>>> memory API? >>>> >>>> Thanks! >>>> -Matt >>>> >>>> _______________________________________________ >>>> vpp-dev mailing list >>>> vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io> >>>> https://lists.fd.io/mailman/listinfo/vpp-dev >>> >> >
_______________________________________________ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev