Hi Brian,

If you’re adding lots of routes, you’ll also need to bump the heap size for the 
IP FIBs as well as the main heap:
  
https://fdio-vpp.readthedocs.io/en/latest/gettingstarted/users/configuring/startup.html#ip

to run in gdb:
  sudo service vpp stop (or your OS equivalent)
  make build
  sudo gdb –args ./build-root/install-vpp_debug-native/vpp/bin/vpp –c 
<YOUR_CONF_FILE> plugin_path <PATH/TO/ALL/PLUGINS>

hope that helps,

/neale




De : <vpp-dev@lists.fd.io> au nom de Brian Dickson 
<brian.peter.dick...@gmail.com>
Date : mercredi 5 décembre 2018 à 19:31
À : "vpp-dev@lists.fd.io" <vpp-dev@lists.fd.io>
Objet : [vpp-dev] vnet crashes, and problems building debug version (was Re: 
netlink & router (vppsb or patch->vpp) - help building/running)

Greetings again,

Here is more context on the problem I'm seeing.
The problem occurs if a large-ish number of IPv4 prefixes are added to the FIB 
(by way of the netlink and router plugin).

If the prefix count is below some threshold (e.g. 50,000 prefixes), things work 
fine.
At some prefix count (haven't narrowed it down to a specific number, but I 
don't think the actual number is relevant), vnet crashes, in a failure within 
ip4_mtrie.c.

I have been trying to run in debug mode, but am having a lot of difficulty 
building everything with debug.
Basically, the only way I can successfully build everything is to use the 
script vagrant/build.sh (which does a make pkg-rpm that generates a bunch of 
rpm files that I then install with yum).
Then, I have to rebuild things using the instructions from 
vppsb/router/README.md (doing 4 symlinks and various make iterations, and THEN 
having to run some of those with a bunch of CFLAGS values just to get it to 
compile).

I don't see any good/easy way to build debug images from this environment, 
without a LOT of work/investigation on how all the various build components 
work.

Is the problem easy enough to diagnose from a non-symbolic stack dump, or can 
someone provide details on how to build and run vpp with everything to use gdb, 
including the plugins for netlink/router, so the problem can be further 
isolated?

I think there's basically some kind of bug related to the fib stuff in vnet, 
that really needs to be fixed.

The box has an unreasonably large amount of memory (128GB, doing nothing but 
VPP), and I get the same error even if I up the initial heap size by a factor 
of 2^12 (changing 32<<20 to 32ULL<<32).

Please help.

Brian

(In the following, the buffer space message is likely a consequence of the 
thread handling netlink messages dying, rather than a cause.)
Here's the log messages:

Dec  4 17:08:14 sj2tldnslab09 vnet[19785]: dpdk_pool_create:535: 
ioctl(VFIO_IOMMU_MAP_DMA) pool 'dpdk_mbuf_pool_socket0': Inappropriate ioctl 
for device (errno 25)

Dec  4 17:08:14 sj2tldnslab09 vnet[19785]: dpdk_ipsec_process:1026: not enough 
DPDK crypto resources, default to OpenSSL

Dec  4 17:08:16 sj2tldnslab09 vnet[19785]: rtnl_ns_recv:403: Received 
notification while in sync. Restart synchronization.

Dec  4 17:08:16 sj2tldnslab09 vnet[19785]: rtnl_process_read:467: rtnetlink 
recv error (31) []: Bad file descriptor

Dec  4 17:08:58 sj2tldnslab09 vnet[19785]: rtnl_process_read:467: rtnetlink 
recv error (27) []: No buffer space available

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: rtnl_process_read:467: rtnetlink 
recv error (27) []: No buffer space available

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: received signal SIGABRT, PC 
0x7f043c3c7277

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #0  0x00007f043e5c18c5 0x7f043e5c18c5

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #1  0x00007f043c9716d0 0x7f043c9716d0

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #2  0x00007f043c3c7277 gsignal + 0x37

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #3  0x00007f043c3c8968 abort + 0x148

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #4  0x00005569eb7900d3 0x5569eb7900d3

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #5  0x00007f043d0e8512 
vec_resize_allocate_memory + 0x2f2

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #6  0x00007f043dd9809f 0x7f043dd9809f

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #7  0x00007f043dd985cd 
ip4_fib_mtrie_route_add + 0x17d

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #8  0x00007f043e129b08 
fib_entry_src_action_install + 0xb8

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #9  0x00007f043e1274a0 
fib_entry_create + 0x70

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #10 0x00007f043e11e890 
fib_table_entry_path_add2 + 0x190

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #11 0x00007f03f86833fd add_del_route 
+ 0x34c

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #12 0x00007f03f8683594 
netns_notify_cb + 0x8c

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #13 0x00007f03f8466e71 netns_notify 
+ 0x1f3

Dec  4 17:09:07 sj2tldnslab09 vnet[19785]: #14 0x00007f03f84684ed ns_rcv_route 
+ 0x825

On Tue, Nov 27, 2018 at 6:17 PM Brian Dickson 
<brian.peter.dick...@gmail.com<mailto:brian.peter.dick...@gmail.com>> wrote:
I have been working with the netlink and router plugins, which I was able to 
build from the 18.07 tree via the instructions in vppsb/router.

(NB: trying to build from anything more recent, e.g. 18.10 or 19.01 breaks, 
with no obvious easy resolution).

When running with these plugins, connected with an open source router (bird 
version 1.6.4 or 2.02) and with a very small routing table, it works really 
really well.

(I was able to run roughly line-rate 10g even with small packets, and when 
using a second host with vpp and the span->pg->pcap to /tmp, didn't lose any 
data.)

However, when trying to load up the routing table, things went sideways, and it 
seems to be something netlink-related.(This was using BGP to feed in 3 copies 
of the full routing table, each copy of which is about 750K routes.)

I was hoping someone could provide good instructions (good == tested and works) 
on building from a more recent release of VPP to see if it's an issue that has 
been fixed.

If the issue persists and/or looks to be netlink-specific, would anyone be able 
to look into it? I'm happy to provide logs etc.

System is bare metal centos7.5, tons of cores, memory, etc.

The first few messages in syslog look like:

Nov 27 17:57:30 sj2tldnslab09 bird: Kernel dropped some netlink messages, will 
resync on next scan.

Nov 27 17:57:30 sj2tldnslab09 vnet[127960]: rtnl_process_read:467: rtnetlink 
recv error (27) []: No buffer space available

Nov 27 17:57:30 sj2tldnslab09 vnet[127960]: rtnl_process_read:467: rtnetlink 
recv error (27) []: No buffer space available



After a bunch of similar groups of messages, VPP appears to crash.



If this is a known problem or if there's something that needs to be tweaked on 
the host, any assistance would be greatly appreciated.



Brian
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#11502): https://lists.fd.io/g/vpp-dev/message/11502
Mute This Topic: https://lists.fd.io/mt/28615952/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to