Sounds like one of the threads hasn't finished rebuilding its graph replica 
after you crank up the ipsec tunnel. That operation occurs N-way parallel, with 
each thread rebuilding its own replica.

Take a look at [aka ASSERT] *vlib_worker_threads->node_reforks_required == 0 at 
the moment when you enqueue packets from the brand-new IPSEC tunnel.

Dave

-----Original Message-----
From: [email protected] <[email protected]> On Behalf Of Christian Hopps
Sent: Tuesday, September 17, 2019 10:31 AM
To: vpp-dev <[email protected]>
Cc: [email protected]
Subject: [vpp-dev] Assert verifying node graph construction around ip4-arp.


Hi vpp-dev,

I'm hitting an assert in vlib_next_frame_change_ownership (vlib/main.c)

  ASSERT (vec_len (node->next_nodes) == node_runtime->n_next_nodes);

I added some code to see what was going on

  #ifdef CLIB_ASSERT_ENABLE
    if (vec_len (node->next_nodes) != node_runtime->n_next_nodes)
      {
        clib_warning ("%s: in %s: vec_len (node->next_nodes) %u != %u 
node_runtime->n_next_nodes)",
                      __FUNCTION__, node->name, vec_len (node->next_nodes), 
node_runtime->n_next_nodes);
        for (int i = 0; i < vec_len (node->next_nodes); i++)
            clib_warning ("%s: next %u: %u %s", __FUNCTION__, i, 
node->next_nodes[i],
                          vlib_get_node (vm, node->next_nodes[i])->name);
      }
  #endif
    ASSERT (vec_len (node->next_nodes) == node_runtime->n_next_nodes);

And this was the output:

  2019-09-17 12:21:18.375523: 1: vlib_next_frame_change_ownership:293: 
vlib_next_frame_change_ownership: in ip4-arp: vec_len (node->next_nodes) 3 != 2 
node_runtime-    >n_next_nodes)
  2019-09-17 12:21:18.375600: 1: vlib_next_frame_change_ownership:296: 
vlib_next_frame_change_ownership: next 0: 595 error-drop
  2019-09-17 12:21:18.375630: 1: vlib_next_frame_change_ownership:296: 
vlib_next_frame_change_ownership: next 1: 611 UnknownEthernet1-output
  2019-09-17 12:21:18.375654: 1: vlib_next_frame_change_ownership:296: 
vlib_next_frame_change_ownership: next 2: 0 null-node
  2019-09-17 12:21:18.375678: 1: 
/var/build/mb-build/openwrt-dd/build_dir/target-aarch64_cortex-a53+neon-vfpv4_glibc-2.22/vpp-19.04.2/src/vlib/main.c:300
 (vlib_next_frame_change_ownership) assertion `vec_len (node->next_nodes) == 
node_runtime->n_next_nodes' fails

The use case is that I've got an ipsec tunnel that's *just* been brought up 
(i.e., admin up). I'm immediately sending traffic on it in a worker thread 
(polling output routine) to the other end of the tunnel (and thus receiving 
from the other end which is doing the same thing). Depending on the order the 
tunnel endpoint interfaces are brought up I may or may not hit the above, but 
it happens most of the time on at least one endpoint.

In any case, is this hitting some sort of race condition with node/graph 
construction? I wonder this b/c I would think it should not happen that a 
node's next array is larger than the node's runtime count of the same, but only 
for some short period of time.

Thanks,
Chris.
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14014): https://lists.fd.io/g/vpp-dev/message/14014
Mute This Topic: https://lists.fd.io/mt/34176568/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to