Re: [vpp-dev] recovering from a crash with the C shared memory API

Matthew Smith Mon, 29 Jan 2018 14:44:29 -0800

Hi Florin,

I rebuilt VPP with your patch applied. It looks like it works. I restarted VPP 
while one of my applications was connected to the API. Then I created some 
activity that would force the application to send a message the API. It logged 
a timeout on sending a message and then disconnected and reconnected 
successfully and the connection worked properly after that.


Thanks!
-Matt



> On Jan 29, 2018, at 3:19 PM, Florin Coras <fcoras.li...@gmail.com> wrote:
> 
> Ow, I’m guilty of having “manually restarted” vpp so I completely avoided the 
> segment cleanup ...
> 
> I can’t yet figure out why, but it seems that doing vl_client_api_unmap when 
> vpp does not respond leads to breakage. Could you try the quick fix here [1] 
> and see if it fixes your issue?
> 
> Cheers, 
> Florin
> 
> [1] https://gerrit.fd.io/r/#/c/10315/ <https://gerrit.fd.io/r/#/c/10315/>
> 
>> On Jan 29, 2018, at 11:26 AM, Matthew Smith <mgsm...@netgate.com 
>> <mailto:mgsm...@netgate.com>> wrote:
>> 
>> Hi Florin,
>> 
>> If I repeat that test exactly as you ran it, I see the same results as you 
>> did. With a slight modification, the situation I described shows up:
>> 
>> 1. systemctl start vpp
>> 2. start vat, execute sw_interface_dump
>> 3. leave vat running, in another terminal run systemctl restart vpp
>> 4. in still-running vat, execute ip_address_dump ipv4 sw_if_index 1
>> 5. quit vat
>> 6. start vat
>> 
>> Basically, get vat to send a message after vpp has been restarted.
>> 
>> Step 4 shows this error and then the vat prompt returns:
>> 
>> ip_address_dump error: Misc
>> 
>> Step 5 shows this and returns me to the shell:
>> 
>> main:446: BUG: message reply spin-wait timeout
>> vl_client_disconnect:301: peer unresponsive, give up
>> 
>> Step 6 hangs for a couple of minutes and then prints:
>> 
>> vl_map_shmem:639: region init fail
>> connect_to_vlib_internal:398: vl_client_api map rv -2
>> Couldn't connect to vpe, exiting…
>> 
>> 
>> 
>> Are you able to reproduce this?
>> 
>> Thanks!
>> -Matt
>> 
>> 
>> 
>>> On Jan 26, 2018, at 4:54 PM, Florin Coras <fcoras.li...@gmail.com 
>>> <mailto:fcoras.li...@gmail.com>> wrote:
>>> 
>>> Hi Matt, 
>>> 
>>> I tried reproducing this with vpp + vat. Is this a fair equivalent scenario?
>>> 
>>> 1. Start vpp and attach vpp_api_test and send some msg
>>> 2. Restart vpp and stop vat
>>> 3. Restart vat and send message. 
>>> 
>>> The thing is, off of master, this works for me. 
>>> 
>>> Thanks, 
>>> Florin
>>> 
>>>> On Jan 26, 2018, at 2:31 PM, Matthew Smith <mgsm...@netgate.com 
>>>> <mailto:mgsm...@netgate.com>> wrote:
>>>> 
>>>> 
>>>> Hi all,
>>>> 
>>>> I have a few applications that use the shared memory API. I’m running 
>>>> these on CentOS 7.4, and starting VPP using systemd. If VPP happens to 
>>>> crash or be intentionally restarted, those applications never seem to 
>>>> recover their API connection. They notice that the original VPP process 
>>>> died and try to call vl_client_disconnect_from_vlib(). That call tries to 
>>>> send API messages to cleanly shut down its connection. The application 
>>>> will time out waiting for a response, write a message like:
>>>> 
>>>> 'vl_client_disconnect:301: peer unresponsive, give up
>>>> 
>>>> and eventually consider itself disconnected. When it tries to reconnect, 
>>>> it hangs for a while (100 seconds on the last occurrence I checked on) and 
>>>> then prints messages like:
>>>> 
>>>> vl_map_shmem:619: region init fail
>>>> connect_to_vlib_internal:394: vl_client_api map rv -2
>>>> 
>>>> The client keeps on trying and continues seeing those same errors. If the 
>>>> client is restarted, it sees the same errors after restart. It doesn’t 
>>>> recover until VPP is restarted with the client stopped. Once that happens, 
>>>> the client can be started again and successfully connect.
>>>> 
>>>> The VPP systemd service file that is installed with RPMs built via ‘make 
>>>> pkg-rpm' has the following:
>>>> 
>>>> [Service]
>>>> ExecStartPre=-/bin/rm -f /dev/shm/db /dev/shm/global_vm /dev/shm/vpe-api
>>>> 
>>>> When systemd starts VPP, it removes these files which the still-running 
>>>> client applications have run shm_open/mmap on. I am guessing that when 
>>>> those clients try to disconnect with vl_client_disconnect_from_vlib(), 
>>>> they are stomping on something in shared memory that subsequently keeps 
>>>> them from being able to connect. If I comment that command from the 
>>>> systemd service definition, the problem behavior I described above 
>>>> disappears. The applications write one ‘peer unresponsive’ message and 
>>>> then they reconnect to the API successfully and all is (relatively) well. 
>>>> This also is the case if I don’t start VPP with systemd/systemctl and just 
>>>> run /usr/bin/vpp directly.
>>>> 
>>>> Does anyone have any thoughts on whether it would be ok to remove that 
>>>> command from the systemd service file? Or is there some other better way 
>>>> to deal with VPP crashing from the perspective of a client to the shared 
>>>> memory API?
>>>> 
>>>> Thanks!
>>>> -Matt
>>>> 
>>>> _______________________________________________
>>>> vpp-dev mailing list
>>>> vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io>
>>>> https://lists.fd.io/mailman/listinfo/vpp-dev
>>> 
>> 
>

_______________________________________________
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Re: [vpp-dev] recovering from a crash with the C shared memory API

Reply via email to