Re: [vpp-dev] VPP core dump with PPPoE
Hi Raj, Thanks for spending the time for further investigation. Some comments inline. /neale On 18/10/2019 11:27, "vpp-dev@lists.fd.io on behalf of Raj" wrote: On Thu, Oct 17, 2019 at 8:45 PM Neale Ranns (nranns) wrote: > I think your analysis is spot on. Thank you! > How you're a VPP PPPoE expert __ do you have some suggestions on a fix? Thanks for taking a look at this Neale! I have some ideas for a fix to this problem. Alternative 1. During the packet processing time, get the session information by using the adj->rewrite_header.sw_if_index . With this approach fixup_data need not be passed. + pppoe_main_t *pem = _main; + + u32 sw_if_index = adj->rewrite_header.sw_if_index; + u32 session_id = pem->session_index_by_sw_if_index[sw_if_index]; + t = pool_elt_at_index (pem->sessions, session_id); Pros : Without any functin prototype change we can arrive at the solution. Cons : While processing every pppoe packet, we have to get the session data. As you say, this is not very D-cache friendly... Alternative 2. pppoe build rewrite is done based on rewrite data(adj->rewrite_header). Here we have a sw_if_index(adj->rewrite_header.sw_if_index). How is this is different from the session's encap_if_index? I tried to understand this from code, but couldn't figure out, if both are the same. If both are same: Fix would be : At the time processing pppoe packet, just use the 'adj->rewrite_header.sw_if_index'. IIUC adj->rewrite_header.sw_if_index is the index of the PPPoE sub-interface, the action of the fixup is to swap the TX interface to the parent Ethernet - which is (session)->encap_if_index. Alternative 3. As Neale suggested, just push the session's encap_if_index, instead of passing the const void *. But this requires some architectural changes, I believe. Because const void *fixup_data is depended on the function pointer 'adj->sub_type.midchain.fixup_func'. This can bind only during the run time. The pointer passed as func_data must remain valid for the lifetime of the adjacency (which is the lifetime of the session). That's why i originally used the pointer to the session, but as you discovered it reallocs. I would be tempted to malloc the fixup data and attach it to the session, that way when the session itself reallocs, the fixup data is unaffected. I have one question regarding pool memory creation: On each pppoe session addition VPP is resizing the vector pool pem->sessions. Is it ok if I allocate a fixed number of session initially? I am trying to understand the architectural rational behind adding one element to the vector pool at a time, as opposed to allocating a bulk number initially. If it is ok to allocate a large pool initially (preferably taking the max_sessions from a conf file), we can side step this issue. Pools don't resize with each allocation, instead they grow by 3/2 times the current size each time they fill - this is managed by the underlying vector. You can alloc a large number up front, but then you have to play a balancing game of how many you need versus how much memory you waste. Plus if we don't solve this problem, then the pool can never grow and one would be limited to a fixed number of PPPoE sessions - which is not great. /neale Thanks and Regards, Raj -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#14234): https://lists.fd.io/g/vpp-dev/message/14234 Mute This Topic: https://lists.fd.io/mt/34298895/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [vpp-dev] VPP core dump with PPPoE
On Thu, Oct 17, 2019 at 8:45 PM Neale Ranns (nranns) wrote: > I think your analysis is spot on. Thank you! > How you're a VPP PPPoE expert __ do you have some suggestions on a fix? Thanks for taking a look at this Neale! I have some ideas for a fix to this problem. Alternative 1. During the packet processing time, get the session information by using the adj->rewrite_header.sw_if_index . With this approach fixup_data need not be passed. + pppoe_main_t *pem = _main; + + u32 sw_if_index = adj->rewrite_header.sw_if_index; + u32 session_id = pem->session_index_by_sw_if_index[sw_if_index]; + t = pool_elt_at_index (pem->sessions, session_id); Pros : Without any functin prototype change we can arrive at the solution. Cons : While processing every pppoe packet, we have to get the session data. Alternative 2. pppoe build rewrite is done based on rewrite data(adj->rewrite_header). Here we have a sw_if_index(adj->rewrite_header.sw_if_index). How is this is different from the session's encap_if_index? I tried to understand this from code, but couldn't figure out, if both are the same. If both are same: Fix would be : At the time processing pppoe packet, just use the 'adj->rewrite_header.sw_if_index'. Alternative 3. As Neale suggested, just push the session's encap_if_index, instead of passing the const void *. But this requires some architectural changes, I believe. Because const void *fixup_data is depended on the function pointer 'adj->sub_type.midchain.fixup_func'. This can bind only during the run time. I have one question regarding pool memory creation: On each pppoe session addition VPP is resizing the vector pool pem->sessions. Is it ok if I allocate a fixed number of session initially? I am trying to understand the architectural rational behind adding one element to the vector pool at a time, as opposed to allocating a bulk number initially. If it is ok to allocate a large pool initially (preferably taking the max_sessions from a conf file), we can side step this issue. Thanks and Regards, Raj -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#14232): https://lists.fd.io/g/vpp-dev/message/14232 Mute This Topic: https://lists.fd.io/mt/34298895/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-
Re: [vpp-dev] VPP core dump with PPPoE
That will work. Looking into the fixup function, all that's used from the session is the ergress interface. So only that needs to be 'pushed'. I suspect using the whole session was a convenience for the orignal author, whoever that might have been ... /neale On 17/10/2019 18:38, "Benoit Ganne (bganne)" wrote: > I think your analysis is spot on. Thank you! > How you're a VPP PPPoE expert __ do you have some suggestions on a fix? Shouldn't the usual solution of storing the index in the pool instead of the pointer itself be enough? Ie, use session_id as fixup_data, and use it in pppoe_fixup() to retrieve the session from the pool. ben > On 17/10/2019 16:18, "vpp-dev@lists.fd.io on behalf of Raj" d...@lists.fd.io on behalf of rajlistu...@gmail.com> wrote: > > Hello all, > > I could do some more analysis of this issue and I think I have > identified what could be wrong here. > > During pppoe session creation time frame, as we mentioned earlier, the > below function will get invoked from pppoe_update_adj() > > adj_nbr_midchain_update_rewrite(adj_index_t > adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...) > > Note the 3rd arg 'fixup_data', here we are passing the current session > address as fixup_data > > When subsequent sessions are added this memory address will get > altered as we are resizing the vector pool pem->sessions. Hence the > above stored fixup_data (adj->sub_type.midchain.fixup_data) address is > no longer valid, and that address could have been already freed. > > I think that is the reason why we see memory corruption. It would be > great if some one with far better knowledge of VPP internals could > take a look at this and confirm this. > > Thanks and Regards, > > Raj > > On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun > wrote: > > > > Hi Raj, > > > > I tried to reproduce your issue on VPP 20.01 as per your steps for > some times, but cannot reproduce it. > > > > From your description, please set a breakpoint in > vnet_pppoe_add_del_session(). > > And to see what happened when you create your second pppoe session > with traffic in first pppoe session. > > > > Thanks, > > Hongjun > > > > -Original Message----- > > From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of > Raj > > Sent: Monday, September 30, 2019 11:54 PM > > To: vpp-dev > > Subject: Re: [vpp-dev] VPP core dump with PPPoE > > > > Hello all, > > > > I did some more debugging to find out when and where exactly the > pppoe_session_t get corrupted. Added couple of log entries as shown below > to log pppoe session id when a session is created as well as when packets > from north traverses to south. I have tried this in VPP 19.08, 19.04 and > 19.01 with same results. > > > > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id > 20923 vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id > 35666 > > vnet [21892]: pppoe_fixup:169: New_Packet pppoe01 session id > 35666 > > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id > 58191 > > > > The sequence when corruption happens seems to be: > > > > 1. A new session is created > > 2. A packet for the newly created session traverses from north to > south 3. Next session is created - and vpp crashes. > > > > Digging deeper, I added watch for all newly created sessions using > the following gdb script > > > > b /root/build-1901/src/vnet/ip/ip4_forward.c:2444 > > commands 1 > > watch -l ((pppoe_session_t*)adj0- > >sub_type.midchain.fixup_data).session_id > > bt > > continue > > end > > > > gdb, running with this script, bails out with following message: > > > > Thread 1 "vpp_main" hit Hardware watchpoint 2: -location > ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id > > > > Old value = 35666 > > New value = 4883 > > __memset_avx2_unaligned_erms () at &g
Re: [vpp-dev] VPP core dump with PPPoE
> I think your analysis is spot on. Thank you! > How you're a VPP PPPoE expert __ do you have some suggestions on a fix? Shouldn't the usual solution of storing the index in the pool instead of the pointer itself be enough? Ie, use session_id as fixup_data, and use it in pppoe_fixup() to retrieve the session from the pool. ben > On 17/10/2019 16:18, "vpp-dev@lists.fd.io on behalf of Raj" d...@lists.fd.io on behalf of rajlistu...@gmail.com> wrote: > > Hello all, > > I could do some more analysis of this issue and I think I have > identified what could be wrong here. > > During pppoe session creation time frame, as we mentioned earlier, the > below function will get invoked from pppoe_update_adj() > > adj_nbr_midchain_update_rewrite(adj_index_t > adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...) > > Note the 3rd arg 'fixup_data', here we are passing the current session > address as fixup_data > > When subsequent sessions are added this memory address will get > altered as we are resizing the vector pool pem->sessions. Hence the > above stored fixup_data (adj->sub_type.midchain.fixup_data) address is > no longer valid, and that address could have been already freed. > > I think that is the reason why we see memory corruption. It would be > great if some one with far better knowledge of VPP internals could > take a look at this and confirm this. > > Thanks and Regards, > > Raj > > On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun > wrote: > > > > Hi Raj, > > > > I tried to reproduce your issue on VPP 20.01 as per your steps for > some times, but cannot reproduce it. > > > > From your description, please set a breakpoint in > vnet_pppoe_add_del_session(). > > And to see what happened when you create your second pppoe session > with traffic in first pppoe session. > > > > Thanks, > > Hongjun > > > > -----Original Message- > > From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of > Raj > > Sent: Monday, September 30, 2019 11:54 PM > > To: vpp-dev > > Subject: Re: [vpp-dev] VPP core dump with PPPoE > > > > Hello all, > > > > I did some more debugging to find out when and where exactly the > pppoe_session_t get corrupted. Added couple of log entries as shown below > to log pppoe session id when a session is created as well as when packets > from north traverses to south. I have tried this in VPP 19.08, 19.04 and > 19.01 with same results. > > > > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id > 20923 vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id > 35666 > > vnet [21892]: pppoe_fixup:169: New_Packet pppoe01 session id > 35666 > > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id > 58191 > > > > The sequence when corruption happens seems to be: > > > > 1. A new session is created > > 2. A packet for the newly created session traverses from north to > south 3. Next session is created - and vpp crashes. > > > > Digging deeper, I added watch for all newly created sessions using > the following gdb script > > > > b /root/build-1901/src/vnet/ip/ip4_forward.c:2444 > > commands 1 > > watch -l ((pppoe_session_t*)adj0- > >sub_type.midchain.fixup_data).session_id > > bt > > continue > > end > > > > gdb, running with this script, bails out with following message: > > > > Thread 1 "vpp_main" hit Hardware watchpoint 2: -location > ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id > > > > Old value = 35666 > > New value = 4883 > > __memset_avx2_unaligned_erms () at > > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 > > 203 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No > > such file or directory. > > (gdb) > > > > It is interesting to note that 4883 is 0x1313 > > > > Back trace shows the path it took to reach here: > > > > > > (gdb) bt > > #0 __memset_avx2_unaligned_erms () at > > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 > > #1 0x761a4179 in mspace_put (msp=0x7fffb4df7010, > > p_arg=0x7fffb592d9c8) at /root/build- > 1901/src/vppinfra/dlmalloc.c:4294 > > #2 0x7618ea39 in cl
Re: [vpp-dev] VPP core dump with PPPoE
Hi Raj, I think your analysis is spot on. Thank you! How you're a VPP PPPoE expert __ do you have some suggestions on a fix? /neale On 17/10/2019 16:18, "vpp-dev@lists.fd.io on behalf of Raj" wrote: Hello all, I could do some more analysis of this issue and I think I have identified what could be wrong here. During pppoe session creation time frame, as we mentioned earlier, the below function will get invoked from pppoe_update_adj() adj_nbr_midchain_update_rewrite(adj_index_t adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...) Note the 3rd arg 'fixup_data', here we are passing the current session address as fixup_data When subsequent sessions are added this memory address will get altered as we are resizing the vector pool pem->sessions. Hence the above stored fixup_data (adj->sub_type.midchain.fixup_data) address is no longer valid, and that address could have been already freed. I think that is the reason why we see memory corruption. It would be great if some one with far better knowledge of VPP internals could take a look at this and confirm this. Thanks and Regards, Raj On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun wrote: > > Hi Raj, > > I tried to reproduce your issue on VPP 20.01 as per your steps for some times, but cannot reproduce it. > > From your description, please set a breakpoint in vnet_pppoe_add_del_session(). > And to see what happened when you create your second pppoe session with traffic in first pppoe session. > > Thanks, > Hongjun > > -Original Message- > From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Raj > Sent: Monday, September 30, 2019 11:54 PM > To: vpp-dev > Subject: Re: [vpp-dev] VPP core dump with PPPoE > > Hello all, > > I did some more debugging to find out when and where exactly the pppoe_session_t get corrupted. Added couple of log entries as shown below to log pppoe session id when a session is created as well as when packets from north traverses to south. I have tried this in VPP 19.08, 19.04 and 19.01 with same results. > > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 20923 vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 35666 > vnet [21892]: pppoe_fixup:169: New_Packet pppoe01 session id 35666 > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 58191 > > The sequence when corruption happens seems to be: > > 1. A new session is created > 2. A packet for the newly created session traverses from north to south 3. Next session is created - and vpp crashes. > > Digging deeper, I added watch for all newly created sessions using the following gdb script > > b /root/build-1901/src/vnet/ip/ip4_forward.c:2444 > commands 1 > watch -l ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id > bt > continue > end > > gdb, running with this script, bails out with following message: > > Thread 1 "vpp_main" hit Hardware watchpoint 2: -location ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id > > Old value = 35666 > New value = 4883 > __memset_avx2_unaligned_erms () at > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 > 203 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No > such file or directory. > (gdb) > > It is interesting to note that 4883 is 0x1313 > > Back trace shows the path it took to reach here: > > > (gdb) bt > #0 __memset_avx2_unaligned_erms () at > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 > #1 0x761a4179 in mspace_put (msp=0x7fffb4df7010, > p_arg=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/dlmalloc.c:4294 > #2 0x7618ea39 in clib_mem_free (p=0x7fffb592d9c8) at > /root/build-1901/src/vppinfra/mem.h:215 > #3 0x7618edd8 in vec_resize_allocate_memory (v=0x7fffb592da00, length_increment=1, data_bytes=312, header_bytes=56, data_align=64) at > /root/build-1901/src/vppinfra/vec.c:96 > #4 0x7fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00, length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at > /root/build-1901/src/vppinfra/vec.h:147 > #5 0x7fffb0aa9ca4 in vnet_pppoe_add_del_session (a=0x7fffb6703950, sw_if_indexp=0x7fffb67038e8) at > /root/build-1901/src/plugins/pppoe/pppoe.c:335 > #6 0x7fffb0aaadec in pppoe_add_del_session_command_fn > (vm=0x768e3400 , input=0x7
Re: [vpp-dev] VPP core dump with PPPoE
Hello all, I could do some more analysis of this issue and I think I have identified what could be wrong here. During pppoe session creation time frame, as we mentioned earlier, the below function will get invoked from pppoe_update_adj() adj_nbr_midchain_update_rewrite(adj_index_t adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...) Note the 3rd arg 'fixup_data', here we are passing the current session address as fixup_data When subsequent sessions are added this memory address will get altered as we are resizing the vector pool pem->sessions. Hence the above stored fixup_data (adj->sub_type.midchain.fixup_data) address is no longer valid, and that address could have been already freed. I think that is the reason why we see memory corruption. It would be great if some one with far better knowledge of VPP internals could take a look at this and confirm this. Thanks and Regards, Raj On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun wrote: > > Hi Raj, > > I tried to reproduce your issue on VPP 20.01 as per your steps for some > times, but cannot reproduce it. > > From your description, please set a breakpoint in > vnet_pppoe_add_del_session(). > And to see what happened when you create your second pppoe session with > traffic in first pppoe session. > > Thanks, > Hongjun > > -Original Message- > From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Raj > Sent: Monday, September 30, 2019 11:54 PM > To: vpp-dev > Subject: Re: [vpp-dev] VPP core dump with PPPoE > > Hello all, > > I did some more debugging to find out when and where exactly the > pppoe_session_t get corrupted. Added couple of log entries as shown below to > log pppoe session id when a session is created as well as when packets from > north traverses to south. I have tried this in VPP 19.08, 19.04 and 19.01 > with same results. > > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 20923 vnet > [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 35666 > vnet [21892]: pppoe_fixup:169: New_Packet pppoe01 session id 35666 > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 58191 > > The sequence when corruption happens seems to be: > > 1. A new session is created > 2. A packet for the newly created session traverses from north to south 3. > Next session is created - and vpp crashes. > > Digging deeper, I added watch for all newly created sessions using the > following gdb script > > b /root/build-1901/src/vnet/ip/ip4_forward.c:2444 > commands 1 > watch -l ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id > bt > continue > end > > gdb, running with this script, bails out with following message: > > Thread 1 "vpp_main" hit Hardware watchpoint 2: -location > ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id > > Old value = 35666 > New value = 4883 > __memset_avx2_unaligned_erms () at > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 > 203 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No > such file or directory. > (gdb) > > It is interesting to note that 4883 is 0x1313 > > Back trace shows the path it took to reach here: > > > (gdb) bt > #0 __memset_avx2_unaligned_erms () at > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 > #1 0x761a4179 in mspace_put (msp=0x7fffb4df7010, > p_arg=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/dlmalloc.c:4294 > #2 0x7618ea39 in clib_mem_free (p=0x7fffb592d9c8) at > /root/build-1901/src/vppinfra/mem.h:215 > #3 0x7618edd8 in vec_resize_allocate_memory (v=0x7fffb592da00, > length_increment=1, data_bytes=312, header_bytes=56, data_align=64) at > /root/build-1901/src/vppinfra/vec.c:96 > #4 0x7fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00, > length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at > /root/build-1901/src/vppinfra/vec.h:147 > #5 0x7fffb0aa9ca4 in vnet_pppoe_add_del_session (a=0x7fffb6703950, > sw_if_indexp=0x7fffb67038e8) at > /root/build-1901/src/plugins/pppoe/pppoe.c:335 > #6 0x7fffb0aaadec in pppoe_add_del_session_command_fn > (vm=0x768e3400 , input=0x7fffb6703ee0, > cmd=0x7fffb65dd73c) at /root/build-1901/src/plugins/pppoe/pppoe.c:554 > #7 0x76617db0 in vlib_cli_dispatch_sub_commands > (vm=0x768e3400 , cm=0x768e3600 > , input=0x7fffb6703ee0, parent_command_index=21) at > /root/build-1901/src/vlib/cli.c:644 > > This do not occur if traffic is not initiated, and there is no packet flow > through the system. It would be great if some one who understands this code > to confirm if my analysis is correct and give some pointers to figure out > >
Re: [vpp-dev] VPP core dump with PPPoE
Hi Raj, I tried to reproduce your issue on VPP 20.01 as per your steps for some times, but cannot reproduce it. From your description, please set a breakpoint in vnet_pppoe_add_del_session(). And to see what happened when you create your second pppoe session with traffic in first pppoe session. Thanks, Hongjun -Original Message- From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Raj Sent: Monday, September 30, 2019 11:54 PM To: vpp-dev Subject: Re: [vpp-dev] VPP core dump with PPPoE Hello all, I did some more debugging to find out when and where exactly the pppoe_session_t get corrupted. Added couple of log entries as shown below to log pppoe session id when a session is created as well as when packets from north traverses to south. I have tried this in VPP 19.08, 19.04 and 19.01 with same results. vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 20923 vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 35666 vnet [21892]: pppoe_fixup:169: New_Packet pppoe01 session id 35666 vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 58191 The sequence when corruption happens seems to be: 1. A new session is created 2. A packet for the newly created session traverses from north to south 3. Next session is created - and vpp crashes. Digging deeper, I added watch for all newly created sessions using the following gdb script b /root/build-1901/src/vnet/ip/ip4_forward.c:2444 commands 1 watch -l ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id bt continue end gdb, running with this script, bails out with following message: Thread 1 "vpp_main" hit Hardware watchpoint 2: -location ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id Old value = 35666 New value = 4883 __memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 203 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory. (gdb) It is interesting to note that 4883 is 0x1313 Back trace shows the path it took to reach here: (gdb) bt #0 __memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 #1 0x761a4179 in mspace_put (msp=0x7fffb4df7010, p_arg=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/dlmalloc.c:4294 #2 0x7618ea39 in clib_mem_free (p=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/mem.h:215 #3 0x7618edd8 in vec_resize_allocate_memory (v=0x7fffb592da00, length_increment=1, data_bytes=312, header_bytes=56, data_align=64) at /root/build-1901/src/vppinfra/vec.c:96 #4 0x7fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00, length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at /root/build-1901/src/vppinfra/vec.h:147 #5 0x7fffb0aa9ca4 in vnet_pppoe_add_del_session (a=0x7fffb6703950, sw_if_indexp=0x7fffb67038e8) at /root/build-1901/src/plugins/pppoe/pppoe.c:335 #6 0x7fffb0aaadec in pppoe_add_del_session_command_fn (vm=0x768e3400 , input=0x7fffb6703ee0, cmd=0x7fffb65dd73c) at /root/build-1901/src/plugins/pppoe/pppoe.c:554 #7 0x76617db0 in vlib_cli_dispatch_sub_commands (vm=0x768e3400 , cm=0x768e3600 , input=0x7fffb6703ee0, parent_command_index=21) at /root/build-1901/src/vlib/cli.c:644 This do not occur if traffic is not initiated, and there is no packet flow through the system. It would be great if some one who understands this code to confirm if my analysis is correct and give some pointers to figure out 1. Why, when a new session is created, the data of an old session is changed to 0x1313 2. What debugging steps can I take next to figure out why this happens. Thanks and Regards, Raj On Sat, Sep 28, 2019 at 6:09 PM Raj via Lists.Fd.Io wrote: > > Hello all, > > I have done some more tests to pinpoint the exact condition of the > crash. What I could figure out was that the crash happens when memory > is being allocated for pppoe_session_t while packets are flowing > through pppoe interface. > > Here is what I did to arrive at this conclusion: > > 1. Configure VPP without any default route (to ensure packets do not > hit north interface from south) 2. Provision 100 PPPoE clients - No > crash observed 3. Deprovision all 100 PPPoE clients 4. Configure > default route 5. Provision 100 PPPoE clients again, and start a ping > to an external IP from each client - No Crash observed 6. Provision 50 > more PPPoE clients - VPP crashes. > > Based on this test, and from what I could understand from the code, my > guess is that there is some memory corruption happening inside the > pppoe_session_t when memory is being allocated for it when there is > packets traversing through PPPoE interface. > > Thanks and Regards, > > Raj > > > On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io > wrote: > > > > Hel
Re: [vpp-dev] VPP core dump with PPPoE
Hello all, I did some more debugging to find out when and where exactly the pppoe_session_t get corrupted. Added couple of log entries as shown below to log pppoe session id when a session is created as well as when packets from north traverses to south. I have tried this in VPP 19.08, 19.04 and 19.01 with same results. vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 20923 vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 35666 vnet [21892]: pppoe_fixup:169: New_Packet pppoe01 session id 35666 vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 58191 The sequence when corruption happens seems to be: 1. A new session is created 2. A packet for the newly created session traverses from north to south 3. Next session is created - and vpp crashes. Digging deeper, I added watch for all newly created sessions using the following gdb script b /root/build-1901/src/vnet/ip/ip4_forward.c:2444 commands 1 watch -l ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id bt continue end gdb, running with this script, bails out with following message: Thread 1 "vpp_main" hit Hardware watchpoint 2: -location ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id Old value = 35666 New value = 4883 __memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 203 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory. (gdb) It is interesting to note that 4883 is 0x1313 Back trace shows the path it took to reach here: (gdb) bt #0 __memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 #1 0x761a4179 in mspace_put (msp=0x7fffb4df7010, p_arg=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/dlmalloc.c:4294 #2 0x7618ea39 in clib_mem_free (p=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/mem.h:215 #3 0x7618edd8 in vec_resize_allocate_memory (v=0x7fffb592da00, length_increment=1, data_bytes=312, header_bytes=56, data_align=64) at /root/build-1901/src/vppinfra/vec.c:96 #4 0x7fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00, length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at /root/build-1901/src/vppinfra/vec.h:147 #5 0x7fffb0aa9ca4 in vnet_pppoe_add_del_session (a=0x7fffb6703950, sw_if_indexp=0x7fffb67038e8) at /root/build-1901/src/plugins/pppoe/pppoe.c:335 #6 0x7fffb0aaadec in pppoe_add_del_session_command_fn (vm=0x768e3400 , input=0x7fffb6703ee0, cmd=0x7fffb65dd73c) at /root/build-1901/src/plugins/pppoe/pppoe.c:554 #7 0x76617db0 in vlib_cli_dispatch_sub_commands (vm=0x768e3400 , cm=0x768e3600 , input=0x7fffb6703ee0, parent_command_index=21) at /root/build-1901/src/vlib/cli.c:644 This do not occur if traffic is not initiated, and there is no packet flow through the system. It would be great if some one who understands this code to confirm if my analysis is correct and give some pointers to figure out 1. Why, when a new session is created, the data of an old session is changed to 0x1313 2. What debugging steps can I take next to figure out why this happens. Thanks and Regards, Raj On Sat, Sep 28, 2019 at 6:09 PM Raj via Lists.Fd.Io wrote: > > Hello all, > > I have done some more tests to pinpoint the exact condition of the > crash. What I could figure out was that the crash happens when memory > is being allocated for pppoe_session_t while packets are flowing > through pppoe interface. > > Here is what I did to arrive at this conclusion: > > 1. Configure VPP without any default route (to ensure packets do not > hit north interface from south) > 2. Provision 100 PPPoE clients - No crash observed > 3. Deprovision all 100 PPPoE clients > 4. Configure default route > 5. Provision 100 PPPoE clients again, and start a ping to an external > IP from each client - No Crash observed > 6. Provision 50 more PPPoE clients - VPP crashes. > > Based on this test, and from what I could understand from the code, my > guess is that there is some memory corruption happening inside the > pppoe_session_t when memory is being allocated for it when there is > packets traversing through PPPoE interface. > > Thanks and Regards, > > Raj > > > On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io > wrote: > > > > Hello all, > > > > I am observing a VPP crash when approximately 20 - 50 PPPoE clients > > are connecting and traffic is flowing through them. This crash was > > reproducible every time I tried. > > > > I did some debugging and here is what I could find out so far: > > > > If I understand correctly, when a incoming packet from north side is > > being sent to PPPoE interface, pppoe_fixup() is called to update > > pppoe0->length, and t->encap_if_index. Length and encap_if_index is > > taken from adj0->sub_type.midchain.fixup_data > > > > My observation is that while clients are connecting and traffic is > > flowing for connected clients,
Re: [vpp-dev] VPP core dump with PPPoE
Hello all, I have done some more tests to pinpoint the exact condition of the crash. What I could figure out was that the crash happens when memory is being allocated for pppoe_session_t while packets are flowing through pppoe interface. Here is what I did to arrive at this conclusion: 1. Configure VPP without any default route (to ensure packets do not hit north interface from south) 2. Provision 100 PPPoE clients - No crash observed 3. Deprovision all 100 PPPoE clients 4. Configure default route 5. Provision 100 PPPoE clients again, and start a ping to an external IP from each client - No Crash observed 6. Provision 50 more PPPoE clients - VPP crashes. Based on this test, and from what I could understand from the code, my guess is that there is some memory corruption happening inside the pppoe_session_t when memory is being allocated for it when there is packets traversing through PPPoE interface. Thanks and Regards, Raj On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io wrote: > > Hello all, > > I am observing a VPP crash when approximately 20 - 50 PPPoE clients > are connecting and traffic is flowing through them. This crash was > reproducible every time I tried. > > I did some debugging and here is what I could find out so far: > > If I understand correctly, when a incoming packet from north side is > being sent to PPPoE interface, pppoe_fixup() is called to update > pppoe0->length, and t->encap_if_index. Length and encap_if_index is > taken from adj0->sub_type.midchain.fixup_data > > My observation is that while clients are connecting and traffic is > flowing for connected clients, adj0->sub_type.midchain.fixup_data > appears to hold incorrect data, at some point in time, during the > test. What we have seen is the incorrect data > (adj0->sub_type.midchain.fixup_data) is observed for clients which are > already provisioned for some time and which had packets flowing > through them. > > I figured this out by using gdb and inspecting > adj0->sub_type.midchain.fixup_data, after typecasting it into > pppoe_session_t > > In the structure, I could see that session_id, client_ip and encap_idx > are incorrect. I did not check other values in the structure. > > I also added code to log this fields in pppoe_fixup() and logs too > shows incorrect data in the fields. > > Example logs taken just before crash: > > vnet[12988]: pppoe_fixup:243: 40:7b:1b: 0:12:38 -> 2:42: a: 1: 0: 2 , type > 8864 > vnet[12988]: pppoe_fixup:271: pppoe session id 4883, client_ip > 0x13131313 encap idx 0x13131313 > > First log prints out packet headers, to verify that data in packet is > as expected and is correct. Second log prints values in pppoe_session > data, and it can be seen that the values are obviously incorrect. At > this point the packet is sent out through the south interface. Again > after some time the TX index values become some thing similar to > 1422457436 and VPP core dumps. > > We have tested the following scenarios: > > 1. Add PPPoE clients without sending out any traffic: There is no > crash observed. > 2. Add n number of PPPoE clients, load traffic [No adding or removal > or clients while traffic is on, see next scenario]: There is no crash > observed > 3. Load traffic as soon as each client connects: VPP crash observed. > > Another observation is that encap_if_index is available in two places > inside pppoe_fixup: > > 1. adj->rewrite_header.sw_if_index > 2. t->encap_if_index > > t->encap_if_index is used for updating TX, and this gets corrupted, > while adj->rewrite_header.sw_if_index has the correct index. > > I can check and get back if you need any additional information. Let > me know if a bug report is to be created for this. > > Environment: > > vpp# show version verbose > Version: v19.08.1-59~ga2aa83ca9-dirty > Compiled by: root > Compile host: build-02 > Compile date: Thu Sep 26 16:44:00 IST 2019 > Compile location: /root/build-1908 > Compiler: GCC 7.4.0 > Current PID: 7802 > > Operating system: Ubuntu 18.04 amd64 > > startup.conf and associated exec file is attached. > > There is a small patch to stock VPP to disable > ETHERNET_ERROR_L3_MAC_MISMATCH, which is attached. I have also > attached output of show show hardware and gdb bt output. I have the > core file and its matching VPP debs, and can be shared if needed. > > In the bt the incorrect value of index can be seen in bt #5: > > #5 0x7fba88e9ce0b in vlib_increment_combined_counter > (n_bytes=, n_packets=1, index=538976288, > thread_index=0, cm=0x7fba481f46a0) at > /root/build-1908/src/vlib/counter.h:229 > > Thanks and Regards, > > Raj > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > > View/Reply Online (#14063): https://lists.fd.io/g/vpp-dev/message/14063 > Mute This Topic: https://lists.fd.io/mt/34298895/157026 > Group Owner: vpp-dev+ow...@lists.fd.io > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [rajlistu...@gmail.com] >
[vpp-dev] VPP core dump with PPPoE
Hello all, I am observing a VPP crash when approximately 20 - 50 PPPoE clients are connecting and traffic is flowing through them. This crash was reproducible every time I tried. I did some debugging and here is what I could find out so far: If I understand correctly, when a incoming packet from north side is being sent to PPPoE interface, pppoe_fixup() is called to update pppoe0->length, and t->encap_if_index. Length and encap_if_index is taken from adj0->sub_type.midchain.fixup_data My observation is that while clients are connecting and traffic is flowing for connected clients, adj0->sub_type.midchain.fixup_data appears to hold incorrect data, at some point in time, during the test. What we have seen is the incorrect data (adj0->sub_type.midchain.fixup_data) is observed for clients which are already provisioned for some time and which had packets flowing through them. I figured this out by using gdb and inspecting adj0->sub_type.midchain.fixup_data, after typecasting it into pppoe_session_t In the structure, I could see that session_id, client_ip and encap_idx are incorrect. I did not check other values in the structure. I also added code to log this fields in pppoe_fixup() and logs too shows incorrect data in the fields. Example logs taken just before crash: vnet[12988]: pppoe_fixup:243: 40:7b:1b: 0:12:38 -> 2:42: a: 1: 0: 2 , type 8864 vnet[12988]: pppoe_fixup:271: pppoe session id 4883, client_ip 0x13131313 encap idx 0x13131313 First log prints out packet headers, to verify that data in packet is as expected and is correct. Second log prints values in pppoe_session data, and it can be seen that the values are obviously incorrect. At this point the packet is sent out through the south interface. Again after some time the TX index values become some thing similar to 1422457436 and VPP core dumps. We have tested the following scenarios: 1. Add PPPoE clients without sending out any traffic: There is no crash observed. 2. Add n number of PPPoE clients, load traffic [No adding or removal or clients while traffic is on, see next scenario]: There is no crash observed 3. Load traffic as soon as each client connects: VPP crash observed. Another observation is that encap_if_index is available in two places inside pppoe_fixup: 1. adj->rewrite_header.sw_if_index 2. t->encap_if_index t->encap_if_index is used for updating TX, and this gets corrupted, while adj->rewrite_header.sw_if_index has the correct index. I can check and get back if you need any additional information. Let me know if a bug report is to be created for this. Environment: vpp# show version verbose Version: v19.08.1-59~ga2aa83ca9-dirty Compiled by: root Compile host: build-02 Compile date: Thu Sep 26 16:44:00 IST 2019 Compile location: /root/build-1908 Compiler: GCC 7.4.0 Current PID: 7802 Operating system: Ubuntu 18.04 amd64 startup.conf and associated exec file is attached. There is a small patch to stock VPP to disable ETHERNET_ERROR_L3_MAC_MISMATCH, which is attached. I have also attached output of show show hardware and gdb bt output. I have the core file and its matching VPP debs, and can be shared if needed. In the bt the incorrect value of index can be seen in bt #5: #5 0x7fba88e9ce0b in vlib_increment_combined_counter (n_bytes=, n_packets=1, index=538976288, thread_index=0, cm=0x7fba481f46a0) at /root/build-1908/src/vlib/counter.h:229 Thanks and Regards, Raj vpp# show version verbose Version: v19.08.1-59~ga2aa83ca9-dirty Compiled by: root Compile host: build-02 Compile date: Thu Sep 26 16:44:00 IST 2019 Compile location: /root/build-1908 Compiler: GCC 7.4.0 Current PID: 7802 vpp# show hardware NameIdx Link Hardware GigabitEthernet3/0/1 1 up GigabitEthernet3/0/1 Link speed: 1 Gbps Ethernet address 40:7b:1b:00:12:33 Intel e1000 carrier up full duplex mtu 9206 flags: pmd maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum rx: queues 1 (max 8), desc 1024 (min 32 max 4096 align 8) tx: queues 1 (max 8), desc 1024 (min 32 max 4096 align 8) pci: device 8086:1521 subsystem 15bb:0008 address :03:00.01 numa 0 max rx packet len: 16383 promiscuous: unicast off all-multicast off vlan offload: strip off filter off qinq off rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum vlan-filter jumbo-frame scatter keep-crc rx offload active: ipv4-cksum jumbo-frame scatter tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum tcp-tso multi-segs tx offload active: udp-cksum tcp-cksum multi-segs rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp-ex ipv6-udp-ex ipv6-tcp ipv6-udp ipv6-ex ipv6 rss active:none tx burst function: eth_igb_xmit_pkts rx burst