Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-13 Thread Gilles Chanteperdrix
On 07/12/2011 07:43 PM, Jan Kiszka wrote:
 On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
 On 07/12/2011 07:34 PM, Jan Kiszka wrote:
 On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
 On 07/12/2011 02:57 PM, Jan Kiszka wrote:
   xnlock_put_irqrestore(nklock, s);
   xnpod_schedule();
   }
 @@ -1036,6 +1043,7 @@ redo:
* to process this signal anyway.
*/
   if (rthal_current_domain == rthal_root_domain) {
 + XENO_BUGON(NUCLEUS, xnthread_test_info(thread, XNATOMIC));

 Misleading dead code again, XNATOMIC is cleared not ten lines above.

 Nope, I forgot to remove that line.


   if (XENO_DEBUG(NUCLEUS)  (!signal_pending(this_task)
   || this_task-state != TASK_RUNNING))
   xnpod_fatal
 @@ -1044,6 +1052,8 @@ redo:
   return -ERESTARTSYS;
   }
  
 + xnthread_clear_info(thread, XNATOMIC);

 Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
 place at the point it currently is.

 Nope. Now we either clear XNATOMIC after successful migration or when
 the signal is about to be sent (ie. in the hook). That way we can test
 more reliably (TM) in the gatekeeper if the thread can be migrated.

 Ok for adding the XNATOMIC test, because it improves the robustness, but
 why changing the way XNATOMIC is set and clear? Chances of breaking
 thing while changing code in this area are really high...
 
 The current code is (most probably) broken as it does not properly
 synchronizes the gatekeeper against a signaled and runaway target
 Linux task.
 
 We need an indication if a Linux signal will (or already has) woken up
 the to-be-migrated task. That task may have continued over its context,
 potentially on a different CPU. Providing this indication is the purpose
 of changing where XNATOMIC is cleared.

What about synchronizing with the gatekeeper with a semaphore, as done
in the first patch you sent, but doing it in xnshadow_harden, as soon as
we detect that we are not back from schedule in primary mode? It seems
it would avoid any further issue, as we would then be guaranteed that
the thread could not switch to TASK_INTERRUPTIBLE again before the
gatekeeper is finished.

What worries me is the comment in xnshadow_harden:

 * gatekeeper sent us to primary mode. Since
 * TASK_UNINTERRUPTIBLE is unavailable to us without wrecking
 * the runqueue's count of uniniterruptible tasks, we just
 * notice the issue and gracefully fail; the caller will have
 * to process this signal anyway.
 */

Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this
point? Or simply that TASK_UNINTERRUPTIBLE is not available for the
business of xnshadow_harden?

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-13 Thread Jan Kiszka
On 2011-07-13 20:39, Gilles Chanteperdrix wrote:
 On 07/12/2011 07:43 PM, Jan Kiszka wrote:
 On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
 On 07/12/2011 07:34 PM, Jan Kiszka wrote:
 On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
 On 07/12/2011 02:57 PM, Jan Kiszka wrote:
  xnlock_put_irqrestore(nklock, s);
  xnpod_schedule();
  }
 @@ -1036,6 +1043,7 @@ redo:
   * to process this signal anyway.
   */
  if (rthal_current_domain == rthal_root_domain) {
 +XENO_BUGON(NUCLEUS, xnthread_test_info(thread, 
 XNATOMIC));

 Misleading dead code again, XNATOMIC is cleared not ten lines above.

 Nope, I forgot to remove that line.


  if (XENO_DEBUG(NUCLEUS)  (!signal_pending(this_task)
  || this_task-state != TASK_RUNNING))
  xnpod_fatal
 @@ -1044,6 +1052,8 @@ redo:
  return -ERESTARTSYS;
  }
  
 +xnthread_clear_info(thread, XNATOMIC);

 Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
 place at the point it currently is.

 Nope. Now we either clear XNATOMIC after successful migration or when
 the signal is about to be sent (ie. in the hook). That way we can test
 more reliably (TM) in the gatekeeper if the thread can be migrated.

 Ok for adding the XNATOMIC test, because it improves the robustness, but
 why changing the way XNATOMIC is set and clear? Chances of breaking
 thing while changing code in this area are really high...

 The current code is (most probably) broken as it does not properly
 synchronizes the gatekeeper against a signaled and runaway target
 Linux task.

 We need an indication if a Linux signal will (or already has) woken up
 the to-be-migrated task. That task may have continued over its context,
 potentially on a different CPU. Providing this indication is the purpose
 of changing where XNATOMIC is cleared.
 
 What about synchronizing with the gatekeeper with a semaphore, as done
 in the first patch you sent, but doing it in xnshadow_harden, as soon as
 we detect that we are not back from schedule in primary mode? It seems
 it would avoid any further issue, as we would then be guaranteed that
 the thread could not switch to TASK_INTERRUPTIBLE again before the
 gatekeeper is finished.

The problem is that the gatekeeper tests the task state without holding
the task's rq lock (which is not available to us without a kernel
patch). That cannot work reliably as long as we accept signals. That's
why I'm trying to move state change and test under nklock.

 
 What worries me is the comment in xnshadow_harden:
 
* gatekeeper sent us to primary mode. Since
* TASK_UNINTERRUPTIBLE is unavailable to us without wrecking
* the runqueue's count of uniniterruptible tasks, we just
* notice the issue and gracefully fail; the caller will have
* to process this signal anyway.
*/
 
 Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this
 point? Or simply that TASK_UNINTERRUPTIBLE is not available for the
 business of xnshadow_harden?
 

TASK_UNINTERRUPTIBLE is not available without patching the kernel's
scheduler for the reason mentioned in the comment (the scheduler becomes
confused and may pick the wrong tasks, IIRC).

But I would refrain from trying to improve the gatekeeper design. I've
recently mentioned this to Philippe offlist: For Xenomai 3 with some
ipipe v3, we must rather patch schedule() to enable zero-switch domain
migration. Means: enter the scheduler, let it suspend current and pick
another task, but then simply escalate to the RT domain before doing any
context switch. That's much cheaper than the current design and
hopefully also less error-prone.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-13 Thread Gilles Chanteperdrix
On 07/13/2011 09:04 PM, Jan Kiszka wrote:
 On 2011-07-13 20:39, Gilles Chanteperdrix wrote:
 On 07/12/2011 07:43 PM, Jan Kiszka wrote:
 On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
 On 07/12/2011 07:34 PM, Jan Kiszka wrote:
 On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
 On 07/12/2011 02:57 PM, Jan Kiszka wrote:
 xnlock_put_irqrestore(nklock, s);
 xnpod_schedule();
 }
 @@ -1036,6 +1043,7 @@ redo:
  * to process this signal anyway.
  */
 if (rthal_current_domain == rthal_root_domain) {
 +   XENO_BUGON(NUCLEUS, xnthread_test_info(thread, 
 XNATOMIC));

 Misleading dead code again, XNATOMIC is cleared not ten lines above.

 Nope, I forgot to remove that line.


 if (XENO_DEBUG(NUCLEUS)  (!signal_pending(this_task)
 || this_task-state != TASK_RUNNING))
 xnpod_fatal
 @@ -1044,6 +1052,8 @@ redo:
 return -ERESTARTSYS;
 }
  
 +   xnthread_clear_info(thread, XNATOMIC);

 Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
 place at the point it currently is.

 Nope. Now we either clear XNATOMIC after successful migration or when
 the signal is about to be sent (ie. in the hook). That way we can test
 more reliably (TM) in the gatekeeper if the thread can be migrated.

 Ok for adding the XNATOMIC test, because it improves the robustness, but
 why changing the way XNATOMIC is set and clear? Chances of breaking
 thing while changing code in this area are really high...

 The current code is (most probably) broken as it does not properly
 synchronizes the gatekeeper against a signaled and runaway target
 Linux task.

 We need an indication if a Linux signal will (or already has) woken up
 the to-be-migrated task. That task may have continued over its context,
 potentially on a different CPU. Providing this indication is the purpose
 of changing where XNATOMIC is cleared.

 What about synchronizing with the gatekeeper with a semaphore, as done
 in the first patch you sent, but doing it in xnshadow_harden, as soon as
 we detect that we are not back from schedule in primary mode? It seems
 it would avoid any further issue, as we would then be guaranteed that
 the thread could not switch to TASK_INTERRUPTIBLE again before the
 gatekeeper is finished.
 
 The problem is that the gatekeeper tests the task state without holding
 the task's rq lock (which is not available to us without a kernel
 patch). That cannot work reliably as long as we accept signals. That's
 why I'm trying to move state change and test under nklock.
 

 What worries me is the comment in xnshadow_harden:

   * gatekeeper sent us to primary mode. Since
   * TASK_UNINTERRUPTIBLE is unavailable to us without wrecking
   * the runqueue's count of uniniterruptible tasks, we just
   * notice the issue and gracefully fail; the caller will have
   * to process this signal anyway.
   */

 Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this
 point? Or simply that TASK_UNINTERRUPTIBLE is not available for the
 business of xnshadow_harden?

 
 TASK_UNINTERRUPTIBLE is not available without patching the kernel's
 scheduler for the reason mentioned in the comment (the scheduler becomes
 confused and may pick the wrong tasks, IIRC).

Does not using down/up in the taskexit event handler risk to cause the
same issue?

 
 But I would refrain from trying to improve the gatekeeper design. I've
 recently mentioned this to Philippe offlist: For Xenomai 3 with some
 ipipe v3, we must rather patch schedule() to enable zero-switch domain
 migration. Means: enter the scheduler, let it suspend current and pick
 another task, but then simply escalate to the RT domain before doing any
 context switch. That's much cheaper than the current design and
 hopefully also less error-prone.

So, do you want me to merge your for-upstream branch?
-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-13 Thread Philippe Gerum
On Wed, 2011-07-13 at 20:39 +0200, Gilles Chanteperdrix wrote:
 On 07/12/2011 07:43 PM, Jan Kiszka wrote:
  On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
  On 07/12/2011 07:34 PM, Jan Kiszka wrote:
  On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
  On 07/12/2011 02:57 PM, Jan Kiszka wrote:
  xnlock_put_irqrestore(nklock, s);
  xnpod_schedule();
  }
  @@ -1036,6 +1043,7 @@ redo:
   * to process this signal anyway.
   */
  if (rthal_current_domain == rthal_root_domain) {
  +   XENO_BUGON(NUCLEUS, xnthread_test_info(thread, 
  XNATOMIC));
 
  Misleading dead code again, XNATOMIC is cleared not ten lines above.
 
  Nope, I forgot to remove that line.
 
 
  if (XENO_DEBUG(NUCLEUS)  (!signal_pending(this_task)
  || this_task-state != TASK_RUNNING))
  xnpod_fatal
  @@ -1044,6 +1052,8 @@ redo:
  return -ERESTARTSYS;
  }
   
  +   xnthread_clear_info(thread, XNATOMIC);
 
  Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
  place at the point it currently is.
 
  Nope. Now we either clear XNATOMIC after successful migration or when
  the signal is about to be sent (ie. in the hook). That way we can test
  more reliably (TM) in the gatekeeper if the thread can be migrated.
 
  Ok for adding the XNATOMIC test, because it improves the robustness, but
  why changing the way XNATOMIC is set and clear? Chances of breaking
  thing while changing code in this area are really high...
  
  The current code is (most probably) broken as it does not properly
  synchronizes the gatekeeper against a signaled and runaway target
  Linux task.
  
  We need an indication if a Linux signal will (or already has) woken up
  the to-be-migrated task. That task may have continued over its context,
  potentially on a different CPU. Providing this indication is the purpose
  of changing where XNATOMIC is cleared.
 
 What about synchronizing with the gatekeeper with a semaphore, as done
 in the first patch you sent, but doing it in xnshadow_harden, as soon as
 we detect that we are not back from schedule in primary mode? It seems
 it would avoid any further issue, as we would then be guaranteed that
 the thread could not switch to TASK_INTERRUPTIBLE again before the
 gatekeeper is finished.
 
 What worries me is the comment in xnshadow_harden:
 
* gatekeeper sent us to primary mode. Since
* TASK_UNINTERRUPTIBLE is unavailable to us without wrecking
* the runqueue's count of uniniterruptible tasks, we just
* notice the issue and gracefully fail; the caller will have
* to process this signal anyway.
*/
 
 Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this
 point? Or simply that TASK_UNINTERRUPTIBLE is not available for the
 business of xnshadow_harden?

Second interpretation is correct.

 

-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core