Hi Dave,
Please excuse my delayed response. It took some time to recreate this issue.
I made changes to our process node as per your suggestion. now our process
node code looks like this

while (1) {

        vlib_process_wait_for_event_or_clock (vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                break;

            default: /* bug! */
                ASSERT (0);
        }
    }
After these changes we didn't observe any assertions but we hit the process
node suspend issue. with this it is clear other than time out we are not
getting any other events.

In the issue state I have collected vlib_process node
(rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11).

Please find the vlib_process_t and vlib_node_t data structure values
collected in the issue state below.

vlib_process_t:
============
$38 = {
  cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
  node_runtime = {
    cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
    function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
    errors = 0x7f9b3076a560,
    clocks_since_last_overflow = 0,
    max_clock = 3785970526,
    max_clock_n = 0,
    calls_since_last_overflow = 0,
    vectors_since_last_overflow = 0,
    next_frame_index = 1668,
    node_index = 437,
    input_main_loops_per_call = 0,
    main_loop_count_last_dispatch = 4147405645,
    main_loop_vector_stats = {0, 0},
    flags = 0,
    state = 0,
    n_next_nodes = 0,
    cached_next_index = 0,
    thread_index = 0,
    runtime_data = 0x7f9b2da503c6 ""
  },
  return_longjmp = {
    regs = {94502584873984, 140304430422064, 140306731463680,
94502584874048, 94502640552512, 0, 140304430422032, 140306703608766}
  },
  resume_longjmp = {
    regs = {94502584873984, 140304161734368, 140306731463680,
94502584874048, 94502640552512, 0, 140304161734272, 140304430441787}
  },
  *flags = 11, *
  log2_n_stack_bytes = 16,
  suspended_process_frame_index = 0,
  n_suspends = 0,
  pending_event_data_by_type_index = 0x7f9b307b8310,
  non_empty_event_type_bitmap = 0x7f9b307b8390,
  one_time_event_type_bitmap = 0x0,
  event_type_index_by_type_opaque = 0x7f9b2dab8bd8,
  event_type_pool = 0x7f9b2dcb5978,
  resume_clock_interval = 1000,
  stop_timer_handle = 3098,
  output_function = 0x0,
  output_function_arg = 0,
  stack = 0x7f9b1bb78000
}

vlib_node_t
=========
 (gdb) p *n
$17 = {
  function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
  name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process",
  name_elog_string = 11783,
  stats_total = {
    calls = 0,
    vectors = 0,
    clocks = 1971244932732,
    suspends = 6847366,
    max_clock = 3785970526,
    max_clock_n = 0
  },
  stats_last_clear = {
    calls = 0,
    vectors = 0,
    clocks = 0,
    suspends = 0,
    max_clock = 0,
    max_clock_n = 0
  },
  type = VLIB_NODE_TYPE_PROCESS,
  index = 437,
  runtime_index = 40,
  runtime_data = 0x0,
  flags = 0,
  state = 0 '\000',
  runtime_data_bytes = 0 '\000',
  protocol_hint = 0 '\000',
  n_errors = 0,
  scalar_size = 0,
  vector_size = 0,
  error_heap_handle = 0,
  error_heap_index = 0,
  error_counters = 0x0,
  next_node_names = 0x7f9b3076a530,
  next_nodes = 0x0,
  sibling_of = 0x0,
  sibling_bitmap = 0x0,
  n_vectors_by_next_node = 0x0,
  next_slot_by_node = 0x0,
  prev_node_bitmap = 0x0,
  owner_node_index = 4294967295,
  owner_next_index = 4294967295,
  format_buffer = 0x0,
  unformat_buffer = 0x0,
  format_trace = 0x0,
  validate_frame = 0x0,
  state_string = 0x0,
  node_fn_registrations = 0x0
}

I added an assert statement before clearing *VLIB_PROCESS_IS_RUNNING* flag
in *dispatch_suspended_process* function.
But this assert statement is not hitting.

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }
       p->flags &= ~VLIB_PROCESS_IS_RUNNING;
       pool_put_index (nm->suspended_process_frames,
                      p->suspended_process_frame_index);

I am not able to figure out why this process node is suspended in some
scenarios. Can you please help me by providing some pointers to debug and
resolve this issue.

Hi Jinsh,
I applied your patch to my code. The issue is not solved with your patch.
Thank you for helping me out.

Thanks and Regards,
Sudhir


On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <sudhir=
rtbrick....@lists.fd.io> wrote:

> Hi Chetan,
> In our case we are observing this issue occasionally exact steps  to
> recreate the issue are not known.
> I made changes to our process node as suggested by dave and with these
> changes trying to recreate the issue.
>
> Soon I will update my results and findings in this mail thread.
>
> Thanks and Regards,
> Sudhir
>
> On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin...@gmail.com>
> wrote:
>
>> Hi Sudhir,
>>
>> Is your issue resolved?
>>
>> Actually we are facing same issue on vpp.2106.
>> In our case "api-rx-ring" is not getting called.
>> in our usecase workers are calling some functions in main-thread context
>> leading to RPC message and memory is allocated from api section.
>> This leads to Api-segment memory is used fully and leads to crash.
>>
>> Thanks,
>> Chetan
>>
>>
>> On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=
>> rtbrick....@lists.fd.io> wrote:
>>
>>> Hi Dave,
>>> Thank you very much for your inputs. I will try this out and get back to
>>> you with the results.
>>>
>>> Regards,
>>> Sudhir
>>>
>>> On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <v...@barachs.net> wrote:
>>>
>>>> Please try something like this, to eliminate the possibility that some
>>>> bit of code is sending this process an event. It’s not a good idea to skip
>>>> the vec_reset_length (event_data) step.
>>>>
>>>>
>>>>
>>>> while (1)
>>>>
>>>> {
>>>>
>>>>    uword event_type, * event_data = 0;
>>>>
>>>>    int i;
>>>>
>>>>
>>>>
>>>>    vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);
>>>>
>>>>
>>>>
>>>>    event_type = vlib_process_get_events (vm, &event_data);
>>>>
>>>>
>>>>
>>>>    switch (event_type) {
>>>>
>>>>   case ~0: /* handle timer expirations */
>>>>
>>>>        rtb_event_loop_run_once ();
>>>>
>>>>        break;
>>>>
>>>>
>>>>
>>>>    default: /* bug! */
>>>>
>>>>        ASSERT (0);
>>>>
>>>>    }
>>>>
>>>>
>>>>
>>>>    vec_reset_length(event_data);
>>>>
>>>> }
>>>>
>>>>
>>>>
>>>> *From:* vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> *On Behalf Of *Sudhir
>>>> CR via lists.fd.io
>>>> *Sent:* Monday, February 20, 2023 4:02 AM
>>>> *To:* vpp-dev@lists.fd.io
>>>> *Subject:* Re: [vpp-dev] process node suspended indefinitely
>>>>
>>>>
>>>>
>>>> Hi Dave,
>>>> Thank you for your response and help.
>>>>
>>>>
>>>>
>>>> Please find the additional details below.
>>>>
>>>> VPP Version *21.10*
>>>>
>>>>
>>>> We are creating a process node* rtb-vpp-epoll-process *to handle
>>>> control plane events like interface add/delete, route add/delete.
>>>> This process node waits for *10ms* of time (Not Interested in any
>>>> events ) once 10ms is expired it will process control plane events
>>>> mentioned above.
>>>>
>>>> code snippet looks like below
>>>>
>>>>
>>>>
>>>> ```
>>>>
>>>> static uword
>>>> rtb_vpp_epoll_process (vlib_main_t                 *vm,
>>>>                        vlib_node_runtime_t  *rt,
>>>>                        vlib_frame_t         *f)
>>>> {
>>>>
>>>>     ...
>>>>     ...
>>>>     while (1) {
>>>>         vlib_process_wait_for_event_or_clock (vm, 10e-3);
>>>>         vlib_process_get_events (vm, NULL);
>>>>
>>>>         rtb_event_loop_run_once();   *<---- controlplane events
>>>> handling*
>>>>     }
>>>> }
>>>> ```
>>>>
>>>> What we observed is that sometimes (when there is a high controlplane
>>>> load like request to install more routes) "rtb-vpp-epoll-process" is
>>>> suspended and not scheduled furever. this we found by using "show runtime
>>>> rtb-vpp-epoll-process"*  (*in "show runtime rtb-vpp-epoll-process"
>>>> command output suspends counter is not incrementing.)
>>>>
>>>> *show runtime output in working case :*
>>>>
>>>>
>>>> ```
>>>> DBGvpp# show runtime rtb-vpp-epoll-process
>>>>              Name                 State         Calls          Vectors
>>>>        *Suspends*         Clocks       Vectors/Call
>>>> rtb-vpp-epoll-process           any wait                 0
>>>>   0          *192246*          1.91e6            0.00
>>>> DBGvpp#
>>>>
>>>> DBGvpp# show runtime rtb-vpp-epoll-process
>>>>              Name                 State         Calls          Vectors
>>>>        *Suspends*         Clocks       Vectors/Call
>>>> rtb-vpp-epoll-process           any wait                 0
>>>>   0          *193634*          1.89e6            0.00
>>>> DBGvpp#
>>>>
>>>> ```
>>>>
>>>>
>>>> *show runtime output in issue case :```*
>>>>
>>>> DBGvpp# show runtime rtb-vpp-epoll-process
>>>>
>>>>              Name                 State         Calls          Vectors     
>>>>    *Suspends*         Clocks       Vectors/Call
>>>>
>>>> rtb-vpp-epoll-process           any wait                 0               0 
>>>>           *81477*          7.08e6            0.00
>>>>
>>>> DBGvpp# show runtime rtb-vpp-epoll-process
>>>>
>>>>              Name                 State         Calls          Vectors     
>>>>    *Suspends *        Clocks       Vectors/Call
>>>>
>>>> rtb-vpp-epoll-process           any wait                 0               0 
>>>>           *81477*          7.08e6            0.00
>>>>
>>>> *```*
>>>>
>>>> Other process nodes like lldp-process,
>>>> ip4-neighbor-age-process, ip6-ra-process running without any issue. only
>>>> "rtb-vpp-epoll-process" process node suspended forever.
>>>>
>>>>
>>>>
>>>> Please let me know if any additional information is required.
>>>>
>>>> Hi Jinsh,
>>>> Thanks for pointing me to the issue you faced. The issue I am facing
>>>> looks similar.
>>>> I will verify with the given patch.
>>>>
>>>>
>>>> Thanks and Regards,
>>>>
>>>> Sudhir
>>>>
>>>>
>>>>
>>>> On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jins...@chinatelecom.cn>
>>>> wrote:
>>>>
>>>> HI:
>>>>
>>>>
>>>>    - I have the same problem,
>>>>
>>>> bfd process node stop running. I raised this issue,
>>>>
>>>> https://lists.fd.io/g/vpp-dev/message/22380
>>>> I think there is a problem with the porcess scheduling module when
>>>> using the time wheel.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> NOTICE TO RECIPIENT This e-mail message and any attachments are
>>>> confidential and may be privileged. If you received this e-mail in error,
>>>> any review, use, dissemination, distribution, or copying of this e-mail is
>>>> strictly prohibited. Please notify us immediately of the error by return
>>>> e-mail and please delete this message from your system. For more
>>>> information about Rtbrick, please visit us at www.rtbrick.com
>>>>
>>>>
>>>>
>>>>
>>> NOTICE TO RECIPIENT This e-mail message and any attachments are
>>> confidential and may be privileged. If you received this e-mail in error,
>>> any review, use, dissemination, distribution, or copying of this e-mail is
>>> strictly prohibited. Please notify us immediately of the error by return
>>> e-mail and please delete this message from your system. For more
>>> information about Rtbrick, please visit us at www.rtbrick.com
>>>
>>>
>>>
>>>
>>
>>
>>
> NOTICE TO RECIPIENT This e-mail message and any attachments are
> confidential and may be privileged. If you received this e-mail in error,
> any review, use, dissemination, distribution, or copying of this e-mail is
> strictly prohibited. Please notify us immediately of the error by return
> e-mail and please delete this message from your system. For more
> information about Rtbrick, please visit us at www.rtbrick.com
>
> 
>
>

-- 
NOTICE TO
RECIPIENT This e-mail message and any attachments are 
confidential and may be
privileged. If you received this e-mail in error, 
any review, use,
dissemination, distribution, or copying of this e-mail is 
strictly
prohibited. Please notify us immediately of the error by return 
e-mail and
please delete this message from your system. For more 
information about Rtbrick, please visit us at www.rtbrick.com 
<http://www.rtbrick.com>
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22677): https://lists.fd.io/g/vpp-dev/message/22677
Mute This Topic: https://lists.fd.io/mt/97032803/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to