[Xenomai-core] [PATCH 3/4] Uninline heavy locking functions

2008-02-23 Thread Jan Kiszka
At least when SMP is enable, already __xnlock_get becomes far too
heavy-weighted for being inlined. xnlock_put is fine now, but looking
closer at the disassembly still revealed a lot of redundancy related to
acquiring and releasing xnlocks. In fact, we are mostly using
xnlock_get_irqsave and xnlock_put_irqrestore. Both include fiddling with
rthal_local_irq_save/restore, also heavy-weighted on SMP.

So this patch turns the latter two into uninlined functions which
reduces the text size or nucleus and skins significantly on x86-64/SMP
(XENO_OPT_DEBUG_NUCLEUS disabled):

Without any patch of this series:
   textdata bss dec hex filename
  791892168 308   81665   13f01 
kernel/xenomai/skins/native/xeno_native.o
  2666821761104   2994874fc kernel/xenomai/skins/rtdm/xeno_rtdm.o
 1026611376  160224  264261   40845 kernel/xenomai/skins/posix/xeno_posix.o
 1124825440  444340  562262   89456 kernel/xenomai/nucleus/xeno_nucleus.o

With 1+2 applied:
   textdata bss dec hex filename
  760992168 308   78575   132ef 
kernel/xenomai/skins/native/xeno_native.o
  2587121761104   2915171df kernel/xenomai/skins/rtdm/xeno_rtdm.o
  978161376  160224  259416   3f558 kernel/xenomai/skins/posix/xeno_posix.o
 1088185440  444340  558598   88606 kernel/xenomai/nucleus/xeno_nucleus.o

With this one applied:
   textdata bss dec hex filename
  494692168 308   51945cae9 
kernel/xenomai/skins/native/xeno_native.o
  1924721761104   2252757ff kernel/xenomai/skins/rtdm/xeno_rtdm.o
  602001376  160224  221800   36268 kernel/xenomai/skins/posix/xeno_posix.o
  794535440  444340  529233   81351 kernel/xenomai/nucleus/xeno_nucleus.o

Given this dramatic reduction, I really cannot imagine any negative
impact on worst-case latencies. Already the first nesting of nklock in
some hot path (I would say, this is the minimum in practice) should pay
of the additional function call. But I wasn't able to test yet. Anyone
is welcome to try this out, feedback highly appreciated!

Jan


PS: Philippe, ipipe_restore_pipeline_head's disassembly eg. looks awful
for an inline function. That's most obvious on SMP, but it is not much
better on UP. The reason seems to be

ipipe_cpudom_var(__ipipe_pipeline_head(), status)

with its indirections, required to find the head domain slot. Wild idea:
Why not using a fixed slot for the head domain? There can only be one
anyway. That would then also help UP configs where this patch doesn't
improve anything.
---
 include/asm-generic/bits/pod.h |   24 
 include/asm-generic/system.h   |   22 ++
 2 files changed, 26 insertions(+), 20 deletions(-)

Index: b/include/asm-generic/bits/pod.h
===
--- a/include/asm-generic/bits/pod.h
+++ b/include/asm-generic/bits/pod.h
@@ -295,4 +295,28 @@ unsigned long long xnarch_get_cpu_time(v
 
 EXPORT_SYMBOL(xnarch_get_cpu_time);
 
+#ifdef CONFIG_SMP
+spl_t __xnlock_get_irqsave(xnlock_t *lock  XNLOCK_DBG_CONTEXT_ARGS)
+{
+	unsigned long flags;
+
+	rthal_local_irq_save(flags);
+
+	if (__xnlock_get(lock  XNLOCK_DBG_PASS_CONTEXT))
+		flags |= 2;
+
+	return flags;
+}
+EXPORT_SYMBOL(__xnlock_get_irqsave);
+
+void xnlock_put_irqrestore(xnlock_t *lock, spl_t flags)
+{
+	if (!(flags  2))
+		xnlock_put(lock);
+
+	rthal_local_irq_restore(flags  1);
+}
+EXPORT_SYMBOL(xnlock_put_irqrestore);
+#endif /* CONFIG_SMP */
+
 #endif /* !_XENO_ASM_GENERIC_BITS_POD_H */
Index: b/include/asm-generic/system.h
===
--- a/include/asm-generic/system.h
+++ b/include/asm-generic/system.h
@@ -332,26 +332,8 @@ static inline void xnlock_put(xnlock_t *
 	atomic_set(lock-owner, ~0);
 }
 
-static inline spl_t
-__xnlock_get_irqsave(xnlock_t *lock  XNLOCK_DBG_CONTEXT_ARGS)
-{
-	unsigned long flags;
-
-	rthal_local_irq_save(flags);
-
-	if (__xnlock_get(lock  XNLOCK_DBG_PASS_CONTEXT))
-		flags |= 2;
-
-	return flags;
-}
-
-static inline void xnlock_put_irqrestore(xnlock_t *lock, spl_t flags)
-{
-	if (!(flags  2))
-		xnlock_put(lock);
-
-	rthal_local_irq_restore(flags  1);
-}
+spl_t __xnlock_get_irqsave(xnlock_t *lock  XNLOCK_DBG_CONTEXT_ARGS);
+void xnlock_put_irqrestore(xnlock_t *lock, spl_t flags);
 
 static inline int xnarch_send_ipi(xnarch_cpumask_t cpumask)
 {
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [PATCH 3/4] Uninline heavy locking functions

2008-02-23 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
   At least when SMP is enable, already __xnlock_get becomes far too
   heavy-weighted for being inlined. xnlock_put is fine now, but looking
   closer at the disassembly still revealed a lot of redundancy related to
   acquiring and releasing xnlocks. In fact, we are mostly using
   xnlock_get_irqsave and xnlock_put_irqrestore. Both include fiddling with
   rthal_local_irq_save/restore, also heavy-weighted on SMP.
   
   So this patch turns the latter two into uninlined functions which
   reduces the text size or nucleus and skins significantly on x86-64/SMP
   (XENO_OPT_DEBUG_NUCLEUS disabled):
 
 I think the human idea of how long an inline function can be is far more
 restrictive than what a processor can take. When looking at assembly
 code, you always find the code long, whereas in reality it is not that
 long for a processor. 
 
 Besides, IMO, the proper way to uninline xnlock operations is to leave
 the non contended case inline, and to move the spinning out of line.

This patch is not just about uninlining xnlock, that's only one half of
the savings. The other one is irq-disabling via i-pipe. The problem with
our case is that we have no simple single check to find out that we are
on a fast ride. Rather, we have to do quite some calculations/lookups
before the first check, and we have to perform multiple checks even in
the best case.

 
 And this is something we should not do without measuring its impact.

For sure, will be done. But I'm very optimistic about the results given
this massive code size reduction - which should translates in less cache
misses for the worst-case path. What increase latency most for us
(special hardware properties aside) is memory access, both data and text.

Jan




signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [PATCH 3/4] Uninline heavy locking functions

2008-02-23 Thread Gilles Chanteperdrix
Jan Kiszka wrote:
  At least when SMP is enable, already __xnlock_get becomes far too
  heavy-weighted for being inlined. xnlock_put is fine now, but looking
  closer at the disassembly still revealed a lot of redundancy related to
  acquiring and releasing xnlocks. In fact, we are mostly using
  xnlock_get_irqsave and xnlock_put_irqrestore. Both include fiddling with
  rthal_local_irq_save/restore, also heavy-weighted on SMP.
  
  So this patch turns the latter two into uninlined functions which
  reduces the text size or nucleus and skins significantly on x86-64/SMP
  (XENO_OPT_DEBUG_NUCLEUS disabled):

I think the human idea of how long an inline function can be is far more
restrictive than what a processor can take. When looking at assembly
code, you always find the code long, whereas in reality it is not that
long for a processor. 

Besides, IMO, the proper way to uninline xnlock operations is to leave
the non contended case inline, and to move the spinning out of line.

And this is something we should not do without measuring its impact.

-- 


Gilles Chanteperdrix.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [PATCH 3/4] Uninline heavy locking functions

2008-02-23 Thread Jan Kiszka
Gilles Chanteperdrix wrote:
 Jan Kiszka wrote:
   Gilles Chanteperdrix wrote:
Jan Kiszka wrote:
  At least when SMP is enable, already __xnlock_get becomes far too
  heavy-weighted for being inlined. xnlock_put is fine now, but looking
  closer at the disassembly still revealed a lot of redundancy related 
 to
  acquiring and releasing xnlocks. In fact, we are mostly using
  xnlock_get_irqsave and xnlock_put_irqrestore. Both include fiddling 
 with
  rthal_local_irq_save/restore, also heavy-weighted on SMP.
  
  So this patch turns the latter two into uninlined functions which
  reduces the text size or nucleus and skins significantly on x86-64/SMP
  (XENO_OPT_DEBUG_NUCLEUS disabled):

I think the human idea of how long an inline function can be is far more
restrictive than what a processor can take. When looking at assembly
code, you always find the code long, whereas in reality it is not that
long for a processor. 

Besides, IMO, the proper way to uninline xnlock operations is to leave
the non contended case inline, and to move the spinning out of line.
   
   This patch is not just about uninlining xnlock, that's only one half of
   the savings. The other one is irq-disabling via i-pipe. The problem with
   our case is that we have no simple single check to find out that we are
   on a fast ride. Rather, we have to do quite some calculations/lookups
   before the first check, and we have to perform multiple checks even in
   the best case.
 
 This is my fault, a tradeoff I made, I thought that the atomic_cmpxchg
 could be heavy on SMP systems, so I made a first check to see if we are
 not recursing. But we can do the two operations in one move if we accept
 to have a failing atomic_cmpxchg when recursing.

I'm unsure about the cache pressure of cmpxchg vs. plain read. I guess
the existing variant is already better.

Moreover, the spinning code is only a fraction of the fraction. We
cannot eliminate the recursion check, and we still have all the
local_irq_save code. And _all_ this code mostly comes together, thus we
save so much by uninlining those two functions.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core