Re: [vpp-dev] VPP core dump with PPPoE

2019-10-18 Thread Neale Ranns via Lists.Fd.Io
Hi Raj,

Thanks for spending the time for further investigation. Some comments inline.

/neale

On 18/10/2019 11:27, "vpp-dev@lists.fd.io on behalf of Raj" 
 wrote:

On Thu, Oct 17, 2019 at 8:45 PM Neale Ranns (nranns)  
wrote:
> I think your analysis is spot on. Thank you!
> How you're a VPP PPPoE expert __ do you have some suggestions on a fix?

Thanks for taking a look at this Neale!

I have some ideas for a fix to this problem.

Alternative 1.

During the packet processing time, get the session information by
using the adj->rewrite_header.sw_if_index . With this approach
fixup_data need not be passed.

+  pppoe_main_t *pem = _main;
+
+  u32 sw_if_index = adj->rewrite_header.sw_if_index;
+  u32 session_id = pem->session_index_by_sw_if_index[sw_if_index];
+  t = pool_elt_at_index (pem->sessions, session_id);

Pros : Without any functin prototype change we can arrive at the solution.
Cons : While processing every pppoe packet, we have to get the session data.

As you say, this is not very D-cache friendly...

Alternative 2.

pppoe build rewrite is done based on rewrite
data(adj->rewrite_header). Here we have a
sw_if_index(adj->rewrite_header.sw_if_index). How is this is different
from the session's encap_if_index? I tried to understand this from
code, but couldn't figure out, if both are the same.

If both are same:
 Fix would be : At the time processing pppoe packet, just use the
'adj->rewrite_header.sw_if_index'.

IIUC adj->rewrite_header.sw_if_index is the index of the PPPoE sub-interface, 
the action of the fixup is to swap the TX interface to the parent Ethernet - 
which is (session)->encap_if_index.

Alternative 3.

As Neale suggested, just push the session's encap_if_index, instead of
passing the const void *.

But this requires some architectural changes, I believe. Because const
void *fixup_data is depended on the function pointer
'adj->sub_type.midchain.fixup_func'. This can bind only during the run
time.

The pointer passed as func_data must remain valid for the lifetime of the 
adjacency (which is the lifetime of the session). That's why i originally used 
the pointer to the session, but as you discovered it reallocs. I would be 
tempted to malloc the fixup data and attach it to the session, that way when 
the session itself reallocs, the fixup data is unaffected.

I have one question regarding pool memory creation:

On each pppoe session addition VPP is resizing the vector pool
pem->sessions. Is it ok if I allocate a fixed number of session
initially?
I am trying to understand the architectural rational behind adding one
element to the vector pool at a time, as opposed to allocating a bulk
number initially.
If it is ok to allocate a large pool initially (preferably taking the
max_sessions from a conf file), we can side step this issue.

Pools don't resize with each allocation, instead they grow by 3/2 times the 
current size each time they fill - this is managed by the underlying vector. 
You can alloc a large number up front, but then you have to play a balancing 
game of how many you need versus how much memory you waste. Plus if we don't 
solve this problem, then the pool can never grow and one would be limited to a 
fixed number of PPPoE sessions - which is not great.

/neale

Thanks and Regards,

Raj


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14234): https://lists.fd.io/g/vpp-dev/message/14234
Mute This Topic: https://lists.fd.io/mt/34298895/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [vpp-dev] VPP core dump with PPPoE

2019-10-18 Thread Raj
On Thu, Oct 17, 2019 at 8:45 PM Neale Ranns (nranns)  wrote:
> I think your analysis is spot on. Thank you!
> How you're a VPP PPPoE expert __ do you have some suggestions on a fix?

Thanks for taking a look at this Neale!

I have some ideas for a fix to this problem.

Alternative 1.

During the packet processing time, get the session information by
using the adj->rewrite_header.sw_if_index . With this approach
fixup_data need not be passed.

+  pppoe_main_t *pem = _main;
+
+  u32 sw_if_index = adj->rewrite_header.sw_if_index;
+  u32 session_id = pem->session_index_by_sw_if_index[sw_if_index];
+  t = pool_elt_at_index (pem->sessions, session_id);

Pros : Without any functin prototype change we can arrive at the solution.
Cons : While processing every pppoe packet, we have to get the session data.

Alternative 2.

pppoe build rewrite is done based on rewrite
data(adj->rewrite_header). Here we have a
sw_if_index(adj->rewrite_header.sw_if_index). How is this is different
from the session's encap_if_index? I tried to understand this from
code, but couldn't figure out, if both are the same.

If both are same:
 Fix would be : At the time processing pppoe packet, just use the
'adj->rewrite_header.sw_if_index'.

Alternative 3.

As Neale suggested, just push the session's encap_if_index, instead of
passing the const void *.

But this requires some architectural changes, I believe. Because const
void *fixup_data is depended on the function pointer
'adj->sub_type.midchain.fixup_func'. This can bind only during the run
time.

I have one question regarding pool memory creation:

On each pppoe session addition VPP is resizing the vector pool
pem->sessions. Is it ok if I allocate a fixed number of session
initially?
I am trying to understand the architectural rational behind adding one
element to the vector pool at a time, as opposed to allocating a bulk
number initially.
If it is ok to allocate a large pool initially (preferably taking the
max_sessions from a conf file), we can side step this issue.

Thanks and Regards,

Raj
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14232): https://lists.fd.io/g/vpp-dev/message/14232
Mute This Topic: https://lists.fd.io/mt/34298895/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [vpp-dev] VPP core dump with PPPoE

2019-10-17 Thread Neale Ranns via Lists.Fd.Io

That will work.
Looking into the fixup function, all that's used from the session is the 
ergress interface. So only that needs to be 'pushed'. I suspect using the whole 
session was a convenience for the orignal author, whoever that might have been 
...

/neale


On 17/10/2019 18:38, "Benoit Ganne (bganne)"  wrote:

> I think your analysis is spot on. Thank you!
> How you're a VPP PPPoE expert __ do you have some suggestions on a fix?

Shouldn't the usual solution of storing the index in the pool instead of 
the pointer itself be enough?
Ie, use session_id as fixup_data, and use it in pppoe_fixup() to retrieve 
the session from the pool.

ben

> On 17/10/2019 16:18, "vpp-dev@lists.fd.io on behalf of Raj"  d...@lists.fd.io on behalf of rajlistu...@gmail.com> wrote:
> 
> Hello all,
> 
> I could do some more analysis of this issue and I think I have
> identified what could be wrong here.
> 
> During pppoe session creation time frame, as we mentioned earlier, the
> below function will get invoked from pppoe_update_adj()
> 
> adj_nbr_midchain_update_rewrite(adj_index_t
> adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...)
> 
> Note the 3rd arg 'fixup_data', here we are passing the current session
> address as fixup_data
> 
> When subsequent sessions are added this memory address will get
> altered as we are resizing the vector pool pem->sessions. Hence the
> above stored fixup_data (adj->sub_type.midchain.fixup_data) address is
> no longer valid, and that address could have been already freed.
> 
> I think that is the reason why we see memory corruption. It would be
> great if some one with far better knowledge of VPP internals could
> take a look at this and confirm this.
> 
> Thanks and Regards,
> 
> Raj
> 
> On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun 
> wrote:
> >
> > Hi Raj,
> >
> > I tried to reproduce your issue on VPP 20.01 as per your steps for
> some times, but cannot reproduce it.
> >
> > From your description, please set a breakpoint in
> vnet_pppoe_add_del_session().
> > And to see what happened when you create your second pppoe session
> with traffic in first pppoe session.
> >
> > Thanks,
> > Hongjun
> >
> > -Original Message-----
    > > From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of
> Raj
> > Sent: Monday, September 30, 2019 11:54 PM
> > To: vpp-dev 
> > Subject: Re: [vpp-dev] VPP core dump with PPPoE
> >
> > Hello all,
> >
> > I did some more debugging to find out when and where exactly the
> pppoe_session_t get corrupted. Added couple of log entries as shown below
> to log pppoe session id when a session is created as well as when packets
> from north traverses to south. I have tried this in VPP 19.08, 19.04 and
> 19.01 with same results.
> >
> > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id
> 20923 vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id
> 35666
> > vnet [21892]: pppoe_fixup:169:  New_Packet  pppoe01 session id
> 35666
> > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id
> 58191
> >
> > The sequence when corruption happens seems to be:
> >
> > 1. A new session is created
> > 2. A packet for the newly created session traverses from north to
> south 3. Next session is created - and vpp crashes.
> >
> > Digging deeper, I added watch for all newly created sessions using
> the following gdb script
> >
> > b /root/build-1901/src/vnet/ip/ip4_forward.c:2444
> > commands 1
> >  watch -l ((pppoe_session_t*)adj0-
> >sub_type.midchain.fixup_data).session_id
> >  bt
> > continue
> > end
> >
> > gdb, running with this script, bails out with following message:
> >
> > Thread 1 "vpp_main" hit Hardware watchpoint 2: -location
> ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
> >
> > Old value = 35666
> > New value = 4883
> > __memset_avx2_unaligned_erms () at
&g

Re: [vpp-dev] VPP core dump with PPPoE

2019-10-17 Thread Benoit Ganne (bganne) via Lists.Fd.Io
> I think your analysis is spot on. Thank you!
> How you're a VPP PPPoE expert __ do you have some suggestions on a fix?

Shouldn't the usual solution of storing the index in the pool instead of the 
pointer itself be enough?
Ie, use session_id as fixup_data, and use it in pppoe_fixup() to retrieve the 
session from the pool.

ben

> On 17/10/2019 16:18, "vpp-dev@lists.fd.io on behalf of Raj"  d...@lists.fd.io on behalf of rajlistu...@gmail.com> wrote:
> 
> Hello all,
> 
> I could do some more analysis of this issue and I think I have
> identified what could be wrong here.
> 
> During pppoe session creation time frame, as we mentioned earlier, the
> below function will get invoked from pppoe_update_adj()
> 
> adj_nbr_midchain_update_rewrite(adj_index_t
> adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...)
> 
> Note the 3rd arg 'fixup_data', here we are passing the current session
> address as fixup_data
> 
> When subsequent sessions are added this memory address will get
> altered as we are resizing the vector pool pem->sessions. Hence the
> above stored fixup_data (adj->sub_type.midchain.fixup_data) address is
> no longer valid, and that address could have been already freed.
> 
> I think that is the reason why we see memory corruption. It would be
> great if some one with far better knowledge of VPP internals could
> take a look at this and confirm this.
> 
> Thanks and Regards,
> 
> Raj
> 
> On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun 
> wrote:
> >
> > Hi Raj,
> >
> > I tried to reproduce your issue on VPP 20.01 as per your steps for
> some times, but cannot reproduce it.
> >
> > From your description, please set a breakpoint in
> vnet_pppoe_add_del_session().
> > And to see what happened when you create your second pppoe session
> with traffic in first pppoe session.
> >
> > Thanks,
> > Hongjun
> >
> > -----Original Message-
> > From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of
> Raj
> > Sent: Monday, September 30, 2019 11:54 PM
> > To: vpp-dev 
> > Subject: Re: [vpp-dev] VPP core dump with PPPoE
> >
> > Hello all,
> >
> > I did some more debugging to find out when and where exactly the
> pppoe_session_t get corrupted. Added couple of log entries as shown below
> to log pppoe session id when a session is created as well as when packets
> from north traverses to south. I have tried this in VPP 19.08, 19.04 and
> 19.01 with same results.
> >
> > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id
> 20923 vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id
> 35666
> > vnet [21892]: pppoe_fixup:169:  New_Packet  pppoe01 session id
> 35666
> > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id
> 58191
> >
> > The sequence when corruption happens seems to be:
> >
> > 1. A new session is created
> > 2. A packet for the newly created session traverses from north to
> south 3. Next session is created - and vpp crashes.
> >
> > Digging deeper, I added watch for all newly created sessions using
> the following gdb script
> >
> > b /root/build-1901/src/vnet/ip/ip4_forward.c:2444
> > commands 1
> >  watch -l ((pppoe_session_t*)adj0-
> >sub_type.midchain.fixup_data).session_id
> >  bt
> > continue
> > end
> >
> > gdb, running with this script, bails out with following message:
> >
> > Thread 1 "vpp_main" hit Hardware watchpoint 2: -location
> ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
> >
> > Old value = 35666
> > New value = 4883
> > __memset_avx2_unaligned_erms () at
> > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
> > 203 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No
> > such file or directory.
> > (gdb)
> >
> > It is interesting to note that 4883 is 0x1313
> >
> > Back trace shows the path it took to reach here:
> >
> >
> > (gdb) bt
> > #0  __memset_avx2_unaligned_erms () at
> > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
> > #1  0x761a4179 in mspace_put (msp=0x7fffb4df7010,
> > p_arg=0x7fffb592d9c8) at /root/build-
> 1901/src/vppinfra/dlmalloc.c:4294
> > #2  0x7618ea39 in cl

Re: [vpp-dev] VPP core dump with PPPoE

2019-10-17 Thread Neale Ranns via Lists.Fd.Io

Hi Raj,

I think your analysis is spot on. Thank you!
How you're a VPP PPPoE expert __ do you have some suggestions on a fix?

/neale

On 17/10/2019 16:18, "vpp-dev@lists.fd.io on behalf of Raj" 
 wrote:

Hello all,

I could do some more analysis of this issue and I think I have
identified what could be wrong here.

During pppoe session creation time frame, as we mentioned earlier, the
below function will get invoked from pppoe_update_adj()

adj_nbr_midchain_update_rewrite(adj_index_t
adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...)

Note the 3rd arg 'fixup_data', here we are passing the current session
address as fixup_data

When subsequent sessions are added this memory address will get
altered as we are resizing the vector pool pem->sessions. Hence the
above stored fixup_data (adj->sub_type.midchain.fixup_data) address is
no longer valid, and that address could have been already freed.

I think that is the reason why we see memory corruption. It would be
great if some one with far better knowledge of VPP internals could
take a look at this and confirm this.

Thanks and Regards,

Raj

On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun  wrote:
>
> Hi Raj,
>
> I tried to reproduce your issue on VPP 20.01 as per your steps for some 
times, but cannot reproduce it.
>
> From your description, please set a breakpoint in 
vnet_pppoe_add_del_session().
> And to see what happened when you create your second pppoe session with 
traffic in first pppoe session.
>
> Thanks,
> Hongjun
>
> -Original Message-
> From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Raj
> Sent: Monday, September 30, 2019 11:54 PM
    > To: vpp-dev 
> Subject: Re: [vpp-dev] VPP core dump with PPPoE
>
> Hello all,
>
> I did some more debugging to find out when and where exactly the 
pppoe_session_t get corrupted. Added couple of log entries as shown below to 
log pppoe session id when a session is created as well as when packets from 
north traverses to south. I have tried this in VPP 19.08, 19.04 and 19.01 with 
same results.
>
> vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 20923 
vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 35666
> vnet [21892]: pppoe_fixup:169:  New_Packet  pppoe01 session id 35666
> vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 58191
>
> The sequence when corruption happens seems to be:
>
> 1. A new session is created
> 2. A packet for the newly created session traverses from north to south 
3. Next session is created - and vpp crashes.
>
> Digging deeper, I added watch for all newly created sessions using the 
following gdb script
>
> b /root/build-1901/src/vnet/ip/ip4_forward.c:2444
> commands 1
>  watch -l 
((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
>  bt
> continue
> end
>
> gdb, running with this script, bails out with following message:
>
> Thread 1 "vpp_main" hit Hardware watchpoint 2: -location 
((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
>
> Old value = 35666
> New value = 4883
> __memset_avx2_unaligned_erms () at
> ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
> 203 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No
> such file or directory.
> (gdb)
>
> It is interesting to note that 4883 is 0x1313
>
> Back trace shows the path it took to reach here:
>
>
> (gdb) bt
> #0  __memset_avx2_unaligned_erms () at
> ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
> #1  0x761a4179 in mspace_put (msp=0x7fffb4df7010,
> p_arg=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/dlmalloc.c:4294
> #2  0x7618ea39 in clib_mem_free (p=0x7fffb592d9c8) at
> /root/build-1901/src/vppinfra/mem.h:215
> #3  0x7618edd8 in vec_resize_allocate_memory (v=0x7fffb592da00, 
length_increment=1, data_bytes=312, header_bytes=56, data_align=64) at
> /root/build-1901/src/vppinfra/vec.c:96
> #4  0x7fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00, 
length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at
> /root/build-1901/src/vppinfra/vec.h:147
> #5  0x7fffb0aa9ca4 in vnet_pppoe_add_del_session (a=0x7fffb6703950, 
sw_if_indexp=0x7fffb67038e8) at
> /root/build-1901/src/plugins/pppoe/pppoe.c:335
> #6  0x7fffb0aaadec in pppoe_add_del_session_command_fn
> (vm=0x768e3400 , input=0x7

Re: [vpp-dev] VPP core dump with PPPoE

2019-10-17 Thread Raj
Hello all,

I could do some more analysis of this issue and I think I have
identified what could be wrong here.

During pppoe session creation time frame, as we mentioned earlier, the
below function will get invoked from pppoe_update_adj()

adj_nbr_midchain_update_rewrite(adj_index_t
adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...)

Note the 3rd arg 'fixup_data', here we are passing the current session
address as fixup_data

When subsequent sessions are added this memory address will get
altered as we are resizing the vector pool pem->sessions. Hence the
above stored fixup_data (adj->sub_type.midchain.fixup_data) address is
no longer valid, and that address could have been already freed.

I think that is the reason why we see memory corruption. It would be
great if some one with far better knowledge of VPP internals could
take a look at this and confirm this.

Thanks and Regards,

Raj

On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun  wrote:
>
> Hi Raj,
>
> I tried to reproduce your issue on VPP 20.01 as per your steps for some 
> times, but cannot reproduce it.
>
> From your description, please set a breakpoint in 
> vnet_pppoe_add_del_session().
> And to see what happened when you create your second pppoe session with 
> traffic in first pppoe session.
>
> Thanks,
> Hongjun
>
> -Original Message-
> From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Raj
> Sent: Monday, September 30, 2019 11:54 PM
> To: vpp-dev 
> Subject: Re: [vpp-dev] VPP core dump with PPPoE
>
> Hello all,
>
> I did some more debugging to find out when and where exactly the 
> pppoe_session_t get corrupted. Added couple of log entries as shown below to 
> log pppoe session id when a session is created as well as when packets from 
> north traverses to south. I have tried this in VPP 19.08, 19.04 and 19.01 
> with same results.
>
> vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 20923 vnet 
> [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 35666
> vnet [21892]: pppoe_fixup:169:  New_Packet  pppoe01 session id 35666
> vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 58191
>
> The sequence when corruption happens seems to be:
>
> 1. A new session is created
> 2. A packet for the newly created session traverses from north to south 3. 
> Next session is created - and vpp crashes.
>
> Digging deeper, I added watch for all newly created sessions using the 
> following gdb script
>
> b /root/build-1901/src/vnet/ip/ip4_forward.c:2444
> commands 1
>  watch -l ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
>  bt
> continue
> end
>
> gdb, running with this script, bails out with following message:
>
> Thread 1 "vpp_main" hit Hardware watchpoint 2: -location 
> ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
>
> Old value = 35666
> New value = 4883
> __memset_avx2_unaligned_erms () at
> ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
> 203 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No
> such file or directory.
> (gdb)
>
> It is interesting to note that 4883 is 0x1313
>
> Back trace shows the path it took to reach here:
>
>
> (gdb) bt
> #0  __memset_avx2_unaligned_erms () at
> ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
> #1  0x761a4179 in mspace_put (msp=0x7fffb4df7010,
> p_arg=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/dlmalloc.c:4294
> #2  0x7618ea39 in clib_mem_free (p=0x7fffb592d9c8) at
> /root/build-1901/src/vppinfra/mem.h:215
> #3  0x7618edd8 in vec_resize_allocate_memory (v=0x7fffb592da00, 
> length_increment=1, data_bytes=312, header_bytes=56, data_align=64) at
> /root/build-1901/src/vppinfra/vec.c:96
> #4  0x7fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00, 
> length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at
> /root/build-1901/src/vppinfra/vec.h:147
> #5  0x7fffb0aa9ca4 in vnet_pppoe_add_del_session (a=0x7fffb6703950, 
> sw_if_indexp=0x7fffb67038e8) at
> /root/build-1901/src/plugins/pppoe/pppoe.c:335
> #6  0x7fffb0aaadec in pppoe_add_del_session_command_fn
> (vm=0x768e3400 , input=0x7fffb6703ee0,
> cmd=0x7fffb65dd73c) at /root/build-1901/src/plugins/pppoe/pppoe.c:554
> #7  0x76617db0 in vlib_cli_dispatch_sub_commands
> (vm=0x768e3400 , cm=0x768e3600 
> , input=0x7fffb6703ee0, parent_command_index=21) at 
> /root/build-1901/src/vlib/cli.c:644
>
> This do not occur if traffic is not initiated, and there is no packet flow 
> through the system. It would be great if some one who understands this code 
> to confirm if my analysis is correct and give some pointers to figure out
>
>

Re: [vpp-dev] VPP core dump with PPPoE

2019-10-11 Thread Ni, Hongjun
Hi Raj,

I tried to reproduce your issue on VPP 20.01 as per your steps for some times, 
but cannot reproduce it.

From your description, please set a breakpoint in vnet_pppoe_add_del_session().
And to see what happened when you create your second pppoe session with traffic 
in first pppoe session.

Thanks,
Hongjun

-Original Message-
From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Raj
Sent: Monday, September 30, 2019 11:54 PM
To: vpp-dev 
Subject: Re: [vpp-dev] VPP core dump with PPPoE

Hello all,

I did some more debugging to find out when and where exactly the 
pppoe_session_t get corrupted. Added couple of log entries as shown below to 
log pppoe session id when a session is created as well as when packets from 
north traverses to south. I have tried this in VPP 19.08, 19.04 and 19.01 with 
same results.

vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 20923 vnet 
[21892]: pppoe_update_adj:195: New_Session pppoe01 session id 35666
vnet [21892]: pppoe_fixup:169:  New_Packet  pppoe01 session id 35666
vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 58191

The sequence when corruption happens seems to be:

1. A new session is created
2. A packet for the newly created session traverses from north to south 3. Next 
session is created - and vpp crashes.

Digging deeper, I added watch for all newly created sessions using the 
following gdb script

b /root/build-1901/src/vnet/ip/ip4_forward.c:2444
commands 1
 watch -l ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
 bt
continue
end

gdb, running with this script, bails out with following message:

Thread 1 "vpp_main" hit Hardware watchpoint 2: -location 
((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id

Old value = 35666
New value = 4883
__memset_avx2_unaligned_erms () at
../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
203 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No
such file or directory.
(gdb)

It is interesting to note that 4883 is 0x1313

Back trace shows the path it took to reach here:


(gdb) bt
#0  __memset_avx2_unaligned_erms () at
../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
#1  0x761a4179 in mspace_put (msp=0x7fffb4df7010,
p_arg=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/dlmalloc.c:4294
#2  0x7618ea39 in clib_mem_free (p=0x7fffb592d9c8) at
/root/build-1901/src/vppinfra/mem.h:215
#3  0x7618edd8 in vec_resize_allocate_memory (v=0x7fffb592da00, 
length_increment=1, data_bytes=312, header_bytes=56, data_align=64) at
/root/build-1901/src/vppinfra/vec.c:96
#4  0x7fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00, 
length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at
/root/build-1901/src/vppinfra/vec.h:147
#5  0x7fffb0aa9ca4 in vnet_pppoe_add_del_session (a=0x7fffb6703950, 
sw_if_indexp=0x7fffb67038e8) at
/root/build-1901/src/plugins/pppoe/pppoe.c:335
#6  0x7fffb0aaadec in pppoe_add_del_session_command_fn
(vm=0x768e3400 , input=0x7fffb6703ee0,
cmd=0x7fffb65dd73c) at /root/build-1901/src/plugins/pppoe/pppoe.c:554
#7  0x76617db0 in vlib_cli_dispatch_sub_commands
(vm=0x768e3400 , cm=0x768e3600 
, input=0x7fffb6703ee0, parent_command_index=21) at 
/root/build-1901/src/vlib/cli.c:644

This do not occur if traffic is not initiated, and there is no packet flow 
through the system. It would be great if some one who understands this code to 
confirm if my analysis is correct and give some pointers to figure out

1. Why, when a new session is created, the data of an old session is changed to 
0x1313 2. What debugging steps can I take next to figure out why this happens.



Thanks and Regards,


Raj

On Sat, Sep 28, 2019 at 6:09 PM Raj via Lists.Fd.Io 
 wrote:
>
> Hello all,
>
> I have done some more tests to pinpoint the exact condition of the 
> crash. What I could figure out was that the crash happens when memory 
> is being allocated for pppoe_session_t while packets are flowing 
> through pppoe interface.
>
> Here is what I did to arrive at this conclusion:
>
> 1. Configure VPP without any default route (to ensure packets do not 
> hit north interface from south) 2. Provision 100 PPPoE clients - No 
> crash observed 3. Deprovision all 100 PPPoE clients 4. Configure 
> default route 5. Provision 100 PPPoE clients again, and start a ping 
> to an external IP from each client - No Crash observed 6. Provision 50 
> more PPPoE clients - VPP crashes.
>
> Based on this test, and from what I could understand from the code, my 
> guess is  that there is some memory corruption happening inside the 
> pppoe_session_t when memory is being allocated for it when there is 
> packets traversing through PPPoE interface.
>
> Thanks and Regards,
>
> Raj
>
>
> On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io 
>  wrote:
> >
> > Hel

Re: [vpp-dev] VPP core dump with PPPoE

2019-09-30 Thread Raj
Hello all,

I did some more debugging to find out when and where exactly the
pppoe_session_t get corrupted. Added couple of log entries as shown
below to log pppoe session id when a session is created as well as
when packets from north traverses to south. I have tried this in VPP
19.08, 19.04 and 19.01 with same results.

vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 20923
vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 35666
vnet [21892]: pppoe_fixup:169:  New_Packet  pppoe01 session id 35666
vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 58191

The sequence when corruption happens seems to be:

1. A new session is created
2. A packet for the newly created session traverses from north to south
3. Next session is created - and vpp crashes.

Digging deeper, I added watch for all newly created sessions using the
following gdb script

b /root/build-1901/src/vnet/ip/ip4_forward.c:2444
commands 1
 watch -l ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
 bt
continue
end

gdb, running with this script, bails out with following message:

Thread 1 "vpp_main" hit Hardware watchpoint 2: -location
((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id

Old value = 35666
New value = 4883
__memset_avx2_unaligned_erms () at
../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
203 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No
such file or directory.
(gdb)

It is interesting to note that 4883 is 0x1313

Back trace shows the path it took to reach here:


(gdb) bt
#0  __memset_avx2_unaligned_erms () at
../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
#1  0x761a4179 in mspace_put (msp=0x7fffb4df7010,
p_arg=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/dlmalloc.c:4294
#2  0x7618ea39 in clib_mem_free (p=0x7fffb592d9c8) at
/root/build-1901/src/vppinfra/mem.h:215
#3  0x7618edd8 in vec_resize_allocate_memory
(v=0x7fffb592da00, length_increment=1, data_bytes=312,
header_bytes=56, data_align=64) at
/root/build-1901/src/vppinfra/vec.c:96
#4  0x7fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00,
length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at
/root/build-1901/src/vppinfra/vec.h:147
#5  0x7fffb0aa9ca4 in vnet_pppoe_add_del_session
(a=0x7fffb6703950, sw_if_indexp=0x7fffb67038e8) at
/root/build-1901/src/plugins/pppoe/pppoe.c:335
#6  0x7fffb0aaadec in pppoe_add_del_session_command_fn
(vm=0x768e3400 , input=0x7fffb6703ee0,
cmd=0x7fffb65dd73c) at /root/build-1901/src/plugins/pppoe/pppoe.c:554
#7  0x76617db0 in vlib_cli_dispatch_sub_commands
(vm=0x768e3400 , cm=0x768e3600
, input=0x7fffb6703ee0, parent_command_index=21)
at /root/build-1901/src/vlib/cli.c:644

This do not occur if traffic is not initiated, and there is no packet
flow through the system. It would be great if some one who understands
this code to confirm if my analysis is correct and give some pointers
to figure out

1. Why, when a new session is created, the data of an old session is
changed to 0x1313
2. What debugging steps can I take next to figure out why this happens.



Thanks and Regards,


Raj

On Sat, Sep 28, 2019 at 6:09 PM Raj via Lists.Fd.Io
 wrote:
>
> Hello all,
>
> I have done some more tests to pinpoint the exact condition of the
> crash. What I could figure out was that the crash happens when memory
> is being allocated for pppoe_session_t while packets are flowing
> through pppoe interface.
>
> Here is what I did to arrive at this conclusion:
>
> 1. Configure VPP without any default route (to ensure packets do not
> hit north interface from south)
> 2. Provision 100 PPPoE clients - No crash observed
> 3. Deprovision all 100 PPPoE clients
> 4. Configure default route
> 5. Provision 100 PPPoE clients again, and start a ping to an external
> IP from each client - No Crash observed
> 6. Provision 50 more PPPoE clients - VPP crashes.
>
> Based on this test, and from what I could understand from the code, my
> guess is  that there is some memory corruption happening inside the
> pppoe_session_t when memory is being allocated for it when there is
> packets traversing through PPPoE interface.
>
> Thanks and Regards,
>
> Raj
>
>
> On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io
>  wrote:
> >
> > Hello all,
> >
> > I am observing a VPP crash when approximately 20 - 50 PPPoE clients
> > are connecting and traffic is flowing through them. This crash was
> > reproducible every time I tried.
> >
> > I did some debugging and here is what I could find out so far:
> >
> > If I understand correctly, when a incoming packet from north side is
> > being sent to PPPoE interface, pppoe_fixup() is called to update
> > pppoe0->length, and t->encap_if_index. Length and encap_if_index is
> > taken from adj0->sub_type.midchain.fixup_data
> >
> > My observation is that while clients are connecting and traffic is
> > flowing for connected clients, 

Re: [vpp-dev] VPP core dump with PPPoE

2019-09-28 Thread Raj
Hello all,

I have done some more tests to pinpoint the exact condition of the
crash. What I could figure out was that the crash happens when memory
is being allocated for pppoe_session_t while packets are flowing
through pppoe interface.

Here is what I did to arrive at this conclusion:

1. Configure VPP without any default route (to ensure packets do not
hit north interface from south)
2. Provision 100 PPPoE clients - No crash observed
3. Deprovision all 100 PPPoE clients
4. Configure default route
5. Provision 100 PPPoE clients again, and start a ping to an external
IP from each client - No Crash observed
6. Provision 50 more PPPoE clients - VPP crashes.

Based on this test, and from what I could understand from the code, my
guess is  that there is some memory corruption happening inside the
pppoe_session_t when memory is being allocated for it when there is
packets traversing through PPPoE interface.

Thanks and Regards,

Raj


On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io
 wrote:
>
> Hello all,
>
> I am observing a VPP crash when approximately 20 - 50 PPPoE clients
> are connecting and traffic is flowing through them. This crash was
> reproducible every time I tried.
>
> I did some debugging and here is what I could find out so far:
>
> If I understand correctly, when a incoming packet from north side is
> being sent to PPPoE interface, pppoe_fixup() is called to update
> pppoe0->length, and t->encap_if_index. Length and encap_if_index is
> taken from adj0->sub_type.midchain.fixup_data
>
> My observation is that while clients are connecting and traffic is
> flowing for connected clients, adj0->sub_type.midchain.fixup_data
> appears to hold incorrect data, at some point in time, during the
> test. What we have seen is the incorrect data
> (adj0->sub_type.midchain.fixup_data) is observed for clients which are
> already provisioned for some time and which had packets flowing
> through them.
>
> I figured this out by using gdb and inspecting
> adj0->sub_type.midchain.fixup_data, after typecasting it into
> pppoe_session_t
>
> In the structure, I could see that session_id, client_ip and encap_idx
> are incorrect. I did not check other values in the structure.
>
> I also added code to log this fields in pppoe_fixup() and logs too
> shows incorrect data in the fields.
>
> Example logs taken just before crash:
>
> vnet[12988]: pppoe_fixup:243: 40:7b:1b: 0:12:38 ->  2:42: a: 1: 0: 2 , type 
> 8864
> vnet[12988]: pppoe_fixup:271: pppoe session id 4883, client_ip
> 0x13131313 encap idx 0x13131313
>
> First log prints out packet headers, to verify that data in packet is
> as expected and is correct. Second log prints values in pppoe_session
> data, and it can be seen that the values are obviously incorrect. At
> this point the packet is sent out through the south interface. Again
> after some time the TX index values become some thing similar to
> 1422457436 and VPP core dumps.
>
> We have tested the following scenarios:
>
> 1. Add PPPoE clients without sending out any traffic: There is no
> crash observed.
> 2. Add n number of PPPoE clients, load traffic [No adding or removal
> or clients while traffic is on, see next scenario]: There is no crash
> observed
> 3. Load traffic as soon as each client connects: VPP crash observed.
>
> Another observation is that  encap_if_index is available in two places
> inside pppoe_fixup:
>
> 1. adj->rewrite_header.sw_if_index
> 2. t->encap_if_index
>
> t->encap_if_index is used for updating TX, and this gets corrupted,
> while adj->rewrite_header.sw_if_index has the correct index.
>
> I can check and get back if you need any additional information. Let
> me know if a bug report is to be created for this.
>
> Environment:
>
> vpp# show version verbose
> Version:  v19.08.1-59~ga2aa83ca9-dirty
> Compiled by:  root
> Compile host:  build-02
> Compile date:  Thu Sep 26 16:44:00 IST 2019
> Compile location:  /root/build-1908
> Compiler:  GCC 7.4.0
> Current PID:  7802
>
> Operating system: Ubuntu 18.04 amd64
>
> startup.conf and associated exec file is attached.
>
> There is a small patch to stock VPP to disable
> ETHERNET_ERROR_L3_MAC_MISMATCH, which is attached. I have also
> attached output of show show hardware and gdb bt output. I have the
> core file and its matching VPP debs, and can be shared if needed.
>
> In the bt the incorrect value of index can be seen in bt #5:
>
> #5  0x7fba88e9ce0b in vlib_increment_combined_counter
> (n_bytes=, n_packets=1, index=538976288,
> thread_index=0, cm=0x7fba481f46a0) at
> /root/build-1908/src/vlib/counter.h:229
>
> Thanks and Regards,
>
> Raj
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
>
> View/Reply Online (#14063): https://lists.fd.io/g/vpp-dev/message/14063
> Mute This Topic: https://lists.fd.io/mt/34298895/157026
> Group Owner: vpp-dev+ow...@lists.fd.io
> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [rajlistu...@gmail.com]
> 

[vpp-dev] VPP core dump with PPPoE

2019-09-26 Thread Raj
Hello all,

I am observing a VPP crash when approximately 20 - 50 PPPoE clients
are connecting and traffic is flowing through them. This crash was
reproducible every time I tried.

I did some debugging and here is what I could find out so far:

If I understand correctly, when a incoming packet from north side is
being sent to PPPoE interface, pppoe_fixup() is called to update
pppoe0->length, and t->encap_if_index. Length and encap_if_index is
taken from adj0->sub_type.midchain.fixup_data

My observation is that while clients are connecting and traffic is
flowing for connected clients, adj0->sub_type.midchain.fixup_data
appears to hold incorrect data, at some point in time, during the
test. What we have seen is the incorrect data
(adj0->sub_type.midchain.fixup_data) is observed for clients which are
already provisioned for some time and which had packets flowing
through them.

I figured this out by using gdb and inspecting
adj0->sub_type.midchain.fixup_data, after typecasting it into
pppoe_session_t

In the structure, I could see that session_id, client_ip and encap_idx
are incorrect. I did not check other values in the structure.

I also added code to log this fields in pppoe_fixup() and logs too
shows incorrect data in the fields.

Example logs taken just before crash:

vnet[12988]: pppoe_fixup:243: 40:7b:1b: 0:12:38 ->  2:42: a: 1: 0: 2 , type 8864
vnet[12988]: pppoe_fixup:271: pppoe session id 4883, client_ip
0x13131313 encap idx 0x13131313

First log prints out packet headers, to verify that data in packet is
as expected and is correct. Second log prints values in pppoe_session
data, and it can be seen that the values are obviously incorrect. At
this point the packet is sent out through the south interface. Again
after some time the TX index values become some thing similar to
1422457436 and VPP core dumps.

We have tested the following scenarios:

1. Add PPPoE clients without sending out any traffic: There is no
crash observed.
2. Add n number of PPPoE clients, load traffic [No adding or removal
or clients while traffic is on, see next scenario]: There is no crash
observed
3. Load traffic as soon as each client connects: VPP crash observed.

Another observation is that  encap_if_index is available in two places
inside pppoe_fixup:

1. adj->rewrite_header.sw_if_index
2. t->encap_if_index

t->encap_if_index is used for updating TX, and this gets corrupted,
while adj->rewrite_header.sw_if_index has the correct index.

I can check and get back if you need any additional information. Let
me know if a bug report is to be created for this.

Environment:

vpp# show version verbose
Version:  v19.08.1-59~ga2aa83ca9-dirty
Compiled by:  root
Compile host:  build-02
Compile date:  Thu Sep 26 16:44:00 IST 2019
Compile location:  /root/build-1908
Compiler:  GCC 7.4.0
Current PID:  7802

Operating system: Ubuntu 18.04 amd64

startup.conf and associated exec file is attached.

There is a small patch to stock VPP to disable
ETHERNET_ERROR_L3_MAC_MISMATCH, which is attached. I have also
attached output of show show hardware and gdb bt output. I have the
core file and its matching VPP debs, and can be shared if needed.

In the bt the incorrect value of index can be seen in bt #5:

#5  0x7fba88e9ce0b in vlib_increment_combined_counter
(n_bytes=, n_packets=1, index=538976288,
thread_index=0, cm=0x7fba481f46a0) at
/root/build-1908/src/vlib/counter.h:229

Thanks and Regards,

Raj
vpp# show version verbose
Version:  v19.08.1-59~ga2aa83ca9-dirty
Compiled by:  root
Compile host: build-02
Compile date: Thu Sep 26 16:44:00 IST 2019
Compile location: /root/build-1908
Compiler: GCC 7.4.0
Current PID:  7802
vpp# show hardware
  NameIdx   Link  Hardware
GigabitEthernet3/0/1   1 up   GigabitEthernet3/0/1
  Link speed: 1 Gbps
  Ethernet address 40:7b:1b:00:12:33
  Intel e1000
carrier up full duplex mtu 9206
flags: pmd maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum
rx: queues 1 (max 8), desc 1024 (min 32 max 4096 align 8)
tx: queues 1 (max 8), desc 1024 (min 32 max 4096 align 8)
pci: device 8086:1521 subsystem 15bb:0008 address :03:00.01 numa 0
max rx packet len: 16383
promiscuous: unicast off all-multicast off
vlan offload: strip off filter off qinq off
rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum vlan-filter
   jumbo-frame scatter keep-crc
rx offload active: ipv4-cksum jumbo-frame scatter
tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum
   tcp-tso multi-segs
tx offload active: udp-cksum tcp-cksum multi-segs
rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp-ex ipv6-udp-ex ipv6-tcp
   ipv6-udp ipv6-ex ipv6
rss active:none
tx burst function: eth_igb_xmit_pkts
rx burst