Re: [vpp-dev] recovering from a crash with the C shared memory API

Florin Coras Mon, 29 Jan 2018 13:20:37 -0800

Ow, I’m guilty of having “manually restarted” vpp so I completely avoided the 
segment cleanup ...


I can’t yet figure out why, but it seems that doing vl_client_api_unmap when 
vpp does not respond leads to breakage. Could you try the quick fix here [1] 
and see if it fixes your issue?

Cheers, 
Florin

[1] https://gerrit.fd.io/r/#/c/10315/ <https://gerrit.fd.io/r/#/c/10315/>

> On Jan 29, 2018, at 11:26 AM, Matthew Smith <mgsm...@netgate.com> wrote:
> 
> Hi Florin,
> 
> If I repeat that test exactly as you ran it, I see the same results as you 
> did. With a slight modification, the situation I described shows up:
> 
> 1. systemctl start vpp
> 2. start vat, execute sw_interface_dump
> 3. leave vat running, in another terminal run systemctl restart vpp
> 4. in still-running vat, execute ip_address_dump ipv4 sw_if_index 1
> 5. quit vat
> 6. start vat
> 
> Basically, get vat to send a message after vpp has been restarted.
> 
> Step 4 shows this error and then the vat prompt returns:
> 
> ip_address_dump error: Misc
> 
> Step 5 shows this and returns me to the shell:
> 
> main:446: BUG: message reply spin-wait timeout
> vl_client_disconnect:301: peer unresponsive, give up
> 
> Step 6 hangs for a couple of minutes and then prints:
> 
> vl_map_shmem:639: region init fail
> connect_to_vlib_internal:398: vl_client_api map rv -2
> Couldn't connect to vpe, exiting…
> 
> 
> 
> Are you able to reproduce this?
> 
> Thanks!
> -Matt
> 
> 
> 
>> On Jan 26, 2018, at 4:54 PM, Florin Coras <fcoras.li...@gmail.com> wrote:
>> 
>> Hi Matt, 
>> 
>> I tried reproducing this with vpp + vat. Is this a fair equivalent scenario?
>> 
>> 1. Start vpp and attach vpp_api_test and send some msg
>> 2. Restart vpp and stop vat
>> 3. Restart vat and send message. 
>> 
>> The thing is, off of master, this works for me. 
>> 
>> Thanks, 
>> Florin
>> 
>>> On Jan 26, 2018, at 2:31 PM, Matthew Smith <mgsm...@netgate.com> wrote:
>>> 
>>> 
>>> Hi all,
>>> 
>>> I have a few applications that use the shared memory API. I’m running these 
>>> on CentOS 7.4, and starting VPP using systemd. If VPP happens to crash or 
>>> be intentionally restarted, those applications never seem to recover their 
>>> API connection. They notice that the original VPP process died and try to 
>>> call vl_client_disconnect_from_vlib(). That call tries to send API messages 
>>> to cleanly shut down its connection. The application will time out waiting 
>>> for a response, write a message like:
>>> 
>>> 'vl_client_disconnect:301: peer unresponsive, give up
>>> 
>>> and eventually consider itself disconnected. When it tries to reconnect, it 
>>> hangs for a while (100 seconds on the last occurrence I checked on) and 
>>> then prints messages like:
>>> 
>>> vl_map_shmem:619: region init fail
>>> connect_to_vlib_internal:394: vl_client_api map rv -2
>>> 
>>> The client keeps on trying and continues seeing those same errors. If the 
>>> client is restarted, it sees the same errors after restart. It doesn’t 
>>> recover until VPP is restarted with the client stopped. Once that happens, 
>>> the client can be started again and successfully connect.
>>> 
>>> The VPP systemd service file that is installed with RPMs built via ‘make 
>>> pkg-rpm' has the following:
>>> 
>>> [Service]
>>> ExecStartPre=-/bin/rm -f /dev/shm/db /dev/shm/global_vm /dev/shm/vpe-api
>>> 
>>> When systemd starts VPP, it removes these files which the still-running 
>>> client applications have run shm_open/mmap on. I am guessing that when 
>>> those clients try to disconnect with vl_client_disconnect_from_vlib(), they 
>>> are stomping on something in shared memory that subsequently keeps them 
>>> from being able to connect. If I comment that command from the systemd 
>>> service definition, the problem behavior I described above disappears. The 
>>> applications write one ‘peer unresponsive’ message and then they reconnect 
>>> to the API successfully and all is (relatively) well. This also is the case 
>>> if I don’t start VPP with systemd/systemctl and just run /usr/bin/vpp 
>>> directly.
>>> 
>>> Does anyone have any thoughts on whether it would be ok to remove that 
>>> command from the systemd service file? Or is there some other better way to 
>>> deal with VPP crashing from the perspective of a client to the shared 
>>> memory API?
>>> 
>>> Thanks!
>>> -Matt
>>> 
>>> _______________________________________________
>>> vpp-dev mailing list
>>> vpp-dev@lists.fd.io
>>> https://lists.fd.io/mailman/listinfo/vpp-dev
>> 
>

_______________________________________________
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Re: [vpp-dev] recovering from a crash with the C shared memory API

Reply via email to