Hi Florin,

If I repeat that test exactly as you ran it, I see the same results as you did. 
With a slight modification, the situation I described shows up:

1. systemctl start vpp
2. start vat, execute sw_interface_dump
3. leave vat running, in another terminal run systemctl restart vpp
4. in still-running vat, execute ip_address_dump ipv4 sw_if_index 1
5. quit vat
6. start vat

Basically, get vat to send a message after vpp has been restarted.

Step 4 shows this error and then the vat prompt returns:

ip_address_dump error: Misc

Step 5 shows this and returns me to the shell:

main:446: BUG: message reply spin-wait timeout
vl_client_disconnect:301: peer unresponsive, give up

Step 6 hangs for a couple of minutes and then prints:

vl_map_shmem:639: region init fail
connect_to_vlib_internal:398: vl_client_api map rv -2
Couldn't connect to vpe, exiting…



Are you able to reproduce this?

Thanks!
-Matt



> On Jan 26, 2018, at 4:54 PM, Florin Coras <fcoras.li...@gmail.com> wrote:
> 
> Hi Matt, 
> 
> I tried reproducing this with vpp + vat. Is this a fair equivalent scenario?
> 
> 1. Start vpp and attach vpp_api_test and send some msg
> 2. Restart vpp and stop vat
> 3. Restart vat and send message. 
> 
> The thing is, off of master, this works for me. 
> 
> Thanks, 
> Florin
> 
>> On Jan 26, 2018, at 2:31 PM, Matthew Smith <mgsm...@netgate.com> wrote:
>> 
>> 
>> Hi all,
>> 
>> I have a few applications that use the shared memory API. I’m running these 
>> on CentOS 7.4, and starting VPP using systemd. If VPP happens to crash or be 
>> intentionally restarted, those applications never seem to recover their API 
>> connection. They notice that the original VPP process died and try to call 
>> vl_client_disconnect_from_vlib(). That call tries to send API messages to 
>> cleanly shut down its connection. The application will time out waiting for 
>> a response, write a message like:
>> 
>> 'vl_client_disconnect:301: peer unresponsive, give up
>> 
>> and eventually consider itself disconnected. When it tries to reconnect, it 
>> hangs for a while (100 seconds on the last occurrence I checked on) and then 
>> prints messages like:
>> 
>> vl_map_shmem:619: region init fail
>> connect_to_vlib_internal:394: vl_client_api map rv -2
>> 
>> The client keeps on trying and continues seeing those same errors. If the 
>> client is restarted, it sees the same errors after restart. It doesn’t 
>> recover until VPP is restarted with the client stopped. Once that happens, 
>> the client can be started again and successfully connect.
>> 
>> The VPP systemd service file that is installed with RPMs built via ‘make 
>> pkg-rpm' has the following:
>> 
>> [Service]
>> ExecStartPre=-/bin/rm -f /dev/shm/db /dev/shm/global_vm /dev/shm/vpe-api
>> 
>> When systemd starts VPP, it removes these files which the still-running 
>> client applications have run shm_open/mmap on. I am guessing that when those 
>> clients try to disconnect with vl_client_disconnect_from_vlib(), they are 
>> stomping on something in shared memory that subsequently keeps them from 
>> being able to connect. If I comment that command from the systemd service 
>> definition, the problem behavior I described above disappears. The 
>> applications write one ‘peer unresponsive’ message and then they reconnect 
>> to the API successfully and all is (relatively) well. This also is the case 
>> if I don’t start VPP with systemd/systemctl and just run /usr/bin/vpp 
>> directly.
>> 
>> Does anyone have any thoughts on whether it would be ok to remove that 
>> command from the systemd service file? Or is there some other better way to 
>> deal with VPP crashing from the perspective of a client to the shared memory 
>> API?
>> 
>> Thanks!
>> -Matt
>> 
>> _______________________________________________
>> vpp-dev mailing list
>> vpp-dev@lists.fd.io
>> https://lists.fd.io/mailman/listinfo/vpp-dev
> 

_______________________________________________
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Reply via email to