Hi Raj,

I think your analysis is spot on. Thank you!
How you're a VPP PPPoE expert __ do you have some suggestions on a fix?

/neale

On 17/10/2019 16:18, "vpp-dev@lists.fd.io on behalf of Raj" 
<vpp-dev@lists.fd.io on behalf of rajlistu...@gmail.com> wrote:

    Hello all,
    
    I could do some more analysis of this issue and I think I have
    identified what could be wrong here.
    
    During pppoe session creation time frame, as we mentioned earlier, the
    below function will get invoked from pppoe_update_adj()
    
    adj_nbr_midchain_update_rewrite(adj_index_t
    adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...)
    
    Note the 3rd arg 'fixup_data', here we are passing the current session
    address as fixup_data
    
    When subsequent sessions are added this memory address will get
    altered as we are resizing the vector pool pem->sessions. Hence the
    above stored fixup_data (adj->sub_type.midchain.fixup_data) address is
    no longer valid, and that address could have been already freed.
    
    I think that is the reason why we see memory corruption. It would be
    great if some one with far better knowledge of VPP internals could
    take a look at this and confirm this.
    
    Thanks and Regards,
    
    Raj
    
    On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun <hongjun...@intel.com> wrote:
    >
    > Hi Raj,
    >
    > I tried to reproduce your issue on VPP 20.01 as per your steps for some 
times, but cannot reproduce it.
    >
    > From your description, please set a breakpoint in 
vnet_pppoe_add_del_session().
    > And to see what happened when you create your second pppoe session with 
traffic in first pppoe session.
    >
    > Thanks,
    > Hongjun
    >
    > -----Original Message-----
    > From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Raj
    > Sent: Monday, September 30, 2019 11:54 PM
    > To: vpp-dev <vpp-dev@lists.fd.io>
    > Subject: Re: [vpp-dev] VPP core dump with PPPoE
    >
    > Hello all,
    >
    > I did some more debugging to find out when and where exactly the 
pppoe_session_t get corrupted. Added couple of log entries as shown below to 
log pppoe session id when a session is created as well as when packets from 
north traverses to south. I have tried this in VPP 19.08, 19.04 and 19.01 with 
same results.
    >
    > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 20923 
vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 35666
    > vnet [21892]: pppoe_fixup:169:      New_Packet  pppoe01 session id 35666
    > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 58191
    >
    > The sequence when corruption happens seems to be:
    >
    > 1. A new session is created
    > 2. A packet for the newly created session traverses from north to south 
3. Next session is created - and vpp crashes.
    >
    > Digging deeper, I added watch for all newly created sessions using the 
following gdb script
    >
    > b /root/build-1901/src/vnet/ip/ip4_forward.c:2444
    > commands 1
    >  watch -l 
((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
    >  bt
    > continue
    > end
    >
    > gdb, running with this script, bails out with following message:
    >
    > Thread 1 "vpp_main" hit Hardware watchpoint 2: -location 
((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
    >
    > Old value = 35666
    > New value = 4883
    > __memset_avx2_unaligned_erms () at
    > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
    > 203     ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No
    > such file or directory.
    > (gdb)
    >
    > It is interesting to note that 4883 is 0x1313
    >
    > Back trace shows the path it took to reach here:
    >
    >
    > (gdb) bt
    > #0  __memset_avx2_unaligned_erms () at
    > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
    > #1  0x00007ffff61a4179 in mspace_put (msp=0x7fffb4df7010,
    > p_arg=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/dlmalloc.c:4294
    > #2  0x00007ffff618ea39 in clib_mem_free (p=0x7fffb592d9c8) at
    > /root/build-1901/src/vppinfra/mem.h:215
    > #3  0x00007ffff618edd8 in vec_resize_allocate_memory (v=0x7fffb592da00, 
length_increment=1, data_bytes=312, header_bytes=56, data_align=64) at
    > /root/build-1901/src/vppinfra/vec.c:96
    > #4  0x00007fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00, 
length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at
    > /root/build-1901/src/vppinfra/vec.h:147
    > #5  0x00007fffb0aa9ca4 in vnet_pppoe_add_del_session (a=0x7fffb6703950, 
sw_if_indexp=0x7fffb67038e8) at
    > /root/build-1901/src/plugins/pppoe/pppoe.c:335
    > #6  0x00007fffb0aaadec in pppoe_add_del_session_command_fn
    > (vm=0x7ffff68e3400 <vlib_global_main>, input=0x7fffb6703ee0,
    > cmd=0x7fffb65dd73c) at /root/build-1901/src/plugins/pppoe/pppoe.c:554
    > #7  0x00007ffff6617db0 in vlib_cli_dispatch_sub_commands
    > (vm=0x7ffff68e3400 <vlib_global_main>, cm=0x7ffff68e3600 
<vlib_global_main+512>, input=0x7fffb6703ee0, parent_command_index=21) at 
/root/build-1901/src/vlib/cli.c:644
    >
    > This do not occur if traffic is not initiated, and there is no packet 
flow through the system. It would be great if some one who understands this 
code to confirm if my analysis is correct and give some pointers to figure out
    >
    > 1. Why, when a new session is created, the data of an old session is 
changed to 0x1313 2. What debugging steps can I take next to figure out why 
this happens.
    >
    >
    >
    > Thanks and Regards,
    >
    >
    > Raj
    >
    > On Sat, Sep 28, 2019 at 6:09 PM Raj via Lists.Fd.Io 
<rajlistuser=gmail....@lists.fd.io> wrote:
    > >
    > > Hello all,
    > >
    > > I have done some more tests to pinpoint the exact condition of the
    > > crash. What I could figure out was that the crash happens when memory
    > > is being allocated for pppoe_session_t while packets are flowing
    > > through pppoe interface.
    > >
    > > Here is what I did to arrive at this conclusion:
    > >
    > > 1. Configure VPP without any default route (to ensure packets do not
    > > hit north interface from south) 2. Provision 100 PPPoE clients - No
    > > crash observed 3. Deprovision all 100 PPPoE clients 4. Configure
    > > default route 5. Provision 100 PPPoE clients again, and start a ping
    > > to an external IP from each client - No Crash observed 6. Provision 50
    > > more PPPoE clients - VPP crashes.
    > >
    > > Based on this test, and from what I could understand from the code, my
    > > guess is  that there is some memory corruption happening inside the
    > > pppoe_session_t when memory is being allocated for it when there is
    > > packets traversing through PPPoE interface.
    > >
    > > Thanks and Regards,
    > >
    > > Raj
    > >
    > >
    > > On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io
    > > <rajlistuser=gmail....@lists.fd.io> wrote:
    > > >
    > > > Hello all,
    > > >
    > > > I am observing a VPP crash when approximately 20 - 50 PPPoE clients
    > > > are connecting and traffic is flowing through them. This crash was
    > > > reproducible every time I tried.
    > > >
    > > > I did some debugging and here is what I could find out so far:
    > > >
    > > > If I understand correctly, when a incoming packet from north side is
    > > > being sent to PPPoE interface, pppoe_fixup() is called to update
    > > > pppoe0->length, and t->encap_if_index. Length and encap_if_index is
    > > > taken from adj0->sub_type.midchain.fixup_data
    > > >
    > > > My observation is that while clients are connecting and traffic is
    > > > flowing for connected clients, adj0->sub_type.midchain.fixup_data
    > > > appears to hold incorrect data, at some point in time, during the
    > > > test. What we have seen is the incorrect data
    > > > (adj0->sub_type.midchain.fixup_data) is observed for clients which
    > > > are already provisioned for some time and which had packets flowing
    > > > through them.
    > > >
    > > > I figured this out by using gdb and inspecting
    > > > adj0->sub_type.midchain.fixup_data, after typecasting it into
    > > > pppoe_session_t
    > > >
    > > > In the structure, I could see that session_id, client_ip and
    > > > encap_idx are incorrect. I did not check other values in the 
structure.
    > > >
    > > > I also added code to log this fields in pppoe_fixup() and logs too
    > > > shows incorrect data in the fields.
    > > >
    > > > Example logs taken just before crash:
    > > >
    > > > vnet[12988]: pppoe_fixup:243: 40:7b:1b: 0:12:38 ->  2:42: a: 1: 0: 2
    > > > , type 8864
    > > > vnet[12988]: pppoe_fixup:271: pppoe session id 4883, client_ip
    > > > 0x13131313 encap idx 0x13131313
    > > >
    > > > First log prints out packet headers, to verify that data in packet
    > > > is as expected and is correct. Second log prints values in
    > > > pppoe_session data, and it can be seen that the values are obviously
    > > > incorrect. At this point the packet is sent out through the south
    > > > interface. Again after some time the TX index values become some
    > > > thing similar to
    > > > 1422457436 and VPP core dumps.
    > > >
    > > > We have tested the following scenarios:
    > > >
    > > > 1. Add PPPoE clients without sending out any traffic: There is no
    > > > crash observed.
    > > > 2. Add n number of PPPoE clients, load traffic [No adding or removal
    > > > or clients while traffic is on, see next scenario]: There is no
    > > > crash observed 3. Load traffic as soon as each client connects: VPP
    > > > crash observed.
    > > >
    > > > Another observation is that  encap_if_index is available in two
    > > > places inside pppoe_fixup:
    > > >
    > > > 1. adj->rewrite_header.sw_if_index
    > > > 2. t->encap_if_index
    > > >
    > > > t->encap_if_index is used for updating TX, and this gets corrupted,
    > > > while adj->rewrite_header.sw_if_index has the correct index.
    > > >
    > > > I can check and get back if you need any additional information. Let
    > > > me know if a bug report is to be created for this.
    > > >
    > > > Environment:
    > > >
    > > > vpp# show version verbose
    > > > Version:          v19.08.1-59~ga2aa83ca9-dirty
    > > > Compiled by:          root
    > > > Compile host:          build-02
    > > > Compile date:          Thu Sep 26 16:44:00 IST 2019
    > > > Compile location:      /root/build-1908
    > > > Compiler:          GCC 7.4.0
    > > > Current PID:          7802
    > > >
    > > > Operating system: Ubuntu 18.04 amd64
    > > >
    > > > startup.conf and associated exec file is attached.
    > > >
    > > > There is a small patch to stock VPP to disable
    > > > ETHERNET_ERROR_L3_MAC_MISMATCH, which is attached. I have also
    > > > attached output of show show hardware and gdb bt output. I have the
    > > > core file and its matching VPP debs, and can be shared if needed.
    > > >
    > > > In the bt the incorrect value of index can be seen in bt #5:
    > > >
    > > > #5  0x00007fba88e9ce0b in vlib_increment_combined_counter
    > > > (n_bytes=<optimized out>, n_packets=1, index=538976288,
    > > > thread_index=0, cm=0x7fba481f46a0) at
    > > > /root/build-1908/src/vlib/counter.h:229
    > > >
    > > > Thanks and Regards,
    > > >
    > > > Raj
    > > > -=-=-=-=-=-=-=-=-=-=-=-
    > > > Links: You receive all messages sent to this group.
    > > >
    > > > View/Reply Online (#14063):
    > > > https://lists.fd.io/g/vpp-dev/message/14063
    > > > Mute This Topic: https://lists.fd.io/mt/34298895/157026
    > > > Group Owner: vpp-dev+ow...@lists.fd.io
    > > > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub
    > > > [rajlistu...@gmail.com]
    > > > -=-=-=-=-=-=-=-=-=-=-=-
    > > -=-=-=-=-=-=-=-=-=-=-=-
    > > Links: You receive all messages sent to this group.
    > >
    > > View/Reply Online (#14081):
    > > https://lists.fd.io/g/vpp-dev/message/14081
    > > Mute This Topic: https://lists.fd.io/mt/34298895/157026
    > > Group Owner: vpp-dev+ow...@lists.fd.io
    > > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub
    > > [rajlistu...@gmail.com]
    > > -=-=-=-=-=-=-=-=-=-=-=-
    

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14199): https://lists.fd.io/g/vpp-dev/message/14199
Mute This Topic: https://lists.fd.io/mt/34298895/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to