[Xenomai-core] [PATCH 3/4] Uninline heavy locking functions
At least when SMP is enable, already __xnlock_get becomes far too heavy-weighted for being inlined. xnlock_put is fine now, but looking closer at the disassembly still revealed a lot of redundancy related to acquiring and releasing xnlocks. In fact, we are mostly using xnlock_get_irqsave and xnlock_put_irqrestore. Both include fiddling with rthal_local_irq_save/restore, also heavy-weighted on SMP. So this patch turns the latter two into uninlined functions which reduces the text size or nucleus and skins significantly on x86-64/SMP (XENO_OPT_DEBUG_NUCLEUS disabled): Without any patch of this series: textdata bss dec hex filename 791892168 308 81665 13f01 kernel/xenomai/skins/native/xeno_native.o 2666821761104 2994874fc kernel/xenomai/skins/rtdm/xeno_rtdm.o 1026611376 160224 264261 40845 kernel/xenomai/skins/posix/xeno_posix.o 1124825440 444340 562262 89456 kernel/xenomai/nucleus/xeno_nucleus.o With 1+2 applied: textdata bss dec hex filename 760992168 308 78575 132ef kernel/xenomai/skins/native/xeno_native.o 2587121761104 2915171df kernel/xenomai/skins/rtdm/xeno_rtdm.o 978161376 160224 259416 3f558 kernel/xenomai/skins/posix/xeno_posix.o 1088185440 444340 558598 88606 kernel/xenomai/nucleus/xeno_nucleus.o With this one applied: textdata bss dec hex filename 494692168 308 51945cae9 kernel/xenomai/skins/native/xeno_native.o 1924721761104 2252757ff kernel/xenomai/skins/rtdm/xeno_rtdm.o 602001376 160224 221800 36268 kernel/xenomai/skins/posix/xeno_posix.o 794535440 444340 529233 81351 kernel/xenomai/nucleus/xeno_nucleus.o Given this dramatic reduction, I really cannot imagine any negative impact on worst-case latencies. Already the first nesting of nklock in some hot path (I would say, this is the minimum in practice) should pay of the additional function call. But I wasn't able to test yet. Anyone is welcome to try this out, feedback highly appreciated! Jan PS: Philippe, ipipe_restore_pipeline_head's disassembly eg. looks awful for an inline function. That's most obvious on SMP, but it is not much better on UP. The reason seems to be ipipe_cpudom_var(__ipipe_pipeline_head(), status) with its indirections, required to find the head domain slot. Wild idea: Why not using a fixed slot for the head domain? There can only be one anyway. That would then also help UP configs where this patch doesn't improve anything. --- include/asm-generic/bits/pod.h | 24 include/asm-generic/system.h | 22 ++ 2 files changed, 26 insertions(+), 20 deletions(-) Index: b/include/asm-generic/bits/pod.h === --- a/include/asm-generic/bits/pod.h +++ b/include/asm-generic/bits/pod.h @@ -295,4 +295,28 @@ unsigned long long xnarch_get_cpu_time(v EXPORT_SYMBOL(xnarch_get_cpu_time); +#ifdef CONFIG_SMP +spl_t __xnlock_get_irqsave(xnlock_t *lock XNLOCK_DBG_CONTEXT_ARGS) +{ + unsigned long flags; + + rthal_local_irq_save(flags); + + if (__xnlock_get(lock XNLOCK_DBG_PASS_CONTEXT)) + flags |= 2; + + return flags; +} +EXPORT_SYMBOL(__xnlock_get_irqsave); + +void xnlock_put_irqrestore(xnlock_t *lock, spl_t flags) +{ + if (!(flags 2)) + xnlock_put(lock); + + rthal_local_irq_restore(flags 1); +} +EXPORT_SYMBOL(xnlock_put_irqrestore); +#endif /* CONFIG_SMP */ + #endif /* !_XENO_ASM_GENERIC_BITS_POD_H */ Index: b/include/asm-generic/system.h === --- a/include/asm-generic/system.h +++ b/include/asm-generic/system.h @@ -332,26 +332,8 @@ static inline void xnlock_put(xnlock_t * atomic_set(lock-owner, ~0); } -static inline spl_t -__xnlock_get_irqsave(xnlock_t *lock XNLOCK_DBG_CONTEXT_ARGS) -{ - unsigned long flags; - - rthal_local_irq_save(flags); - - if (__xnlock_get(lock XNLOCK_DBG_PASS_CONTEXT)) - flags |= 2; - - return flags; -} - -static inline void xnlock_put_irqrestore(xnlock_t *lock, spl_t flags) -{ - if (!(flags 2)) - xnlock_put(lock); - - rthal_local_irq_restore(flags 1); -} +spl_t __xnlock_get_irqsave(xnlock_t *lock XNLOCK_DBG_CONTEXT_ARGS); +void xnlock_put_irqrestore(xnlock_t *lock, spl_t flags); static inline int xnarch_send_ipi(xnarch_cpumask_t cpumask) { ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] [PATCH 3/4] Uninline heavy locking functions
Gilles Chanteperdrix wrote: Jan Kiszka wrote: At least when SMP is enable, already __xnlock_get becomes far too heavy-weighted for being inlined. xnlock_put is fine now, but looking closer at the disassembly still revealed a lot of redundancy related to acquiring and releasing xnlocks. In fact, we are mostly using xnlock_get_irqsave and xnlock_put_irqrestore. Both include fiddling with rthal_local_irq_save/restore, also heavy-weighted on SMP. So this patch turns the latter two into uninlined functions which reduces the text size or nucleus and skins significantly on x86-64/SMP (XENO_OPT_DEBUG_NUCLEUS disabled): I think the human idea of how long an inline function can be is far more restrictive than what a processor can take. When looking at assembly code, you always find the code long, whereas in reality it is not that long for a processor. Besides, IMO, the proper way to uninline xnlock operations is to leave the non contended case inline, and to move the spinning out of line. This patch is not just about uninlining xnlock, that's only one half of the savings. The other one is irq-disabling via i-pipe. The problem with our case is that we have no simple single check to find out that we are on a fast ride. Rather, we have to do quite some calculations/lookups before the first check, and we have to perform multiple checks even in the best case. And this is something we should not do without measuring its impact. For sure, will be done. But I'm very optimistic about the results given this massive code size reduction - which should translates in less cache misses for the worst-case path. What increase latency most for us (special hardware properties aside) is memory access, both data and text. Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] [PATCH 3/4] Uninline heavy locking functions
Jan Kiszka wrote: At least when SMP is enable, already __xnlock_get becomes far too heavy-weighted for being inlined. xnlock_put is fine now, but looking closer at the disassembly still revealed a lot of redundancy related to acquiring and releasing xnlocks. In fact, we are mostly using xnlock_get_irqsave and xnlock_put_irqrestore. Both include fiddling with rthal_local_irq_save/restore, also heavy-weighted on SMP. So this patch turns the latter two into uninlined functions which reduces the text size or nucleus and skins significantly on x86-64/SMP (XENO_OPT_DEBUG_NUCLEUS disabled): I think the human idea of how long an inline function can be is far more restrictive than what a processor can take. When looking at assembly code, you always find the code long, whereas in reality it is not that long for a processor. Besides, IMO, the proper way to uninline xnlock operations is to leave the non contended case inline, and to move the spinning out of line. And this is something we should not do without measuring its impact. -- Gilles Chanteperdrix. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] [PATCH 3/4] Uninline heavy locking functions
Gilles Chanteperdrix wrote: Jan Kiszka wrote: Gilles Chanteperdrix wrote: Jan Kiszka wrote: At least when SMP is enable, already __xnlock_get becomes far too heavy-weighted for being inlined. xnlock_put is fine now, but looking closer at the disassembly still revealed a lot of redundancy related to acquiring and releasing xnlocks. In fact, we are mostly using xnlock_get_irqsave and xnlock_put_irqrestore. Both include fiddling with rthal_local_irq_save/restore, also heavy-weighted on SMP. So this patch turns the latter two into uninlined functions which reduces the text size or nucleus and skins significantly on x86-64/SMP (XENO_OPT_DEBUG_NUCLEUS disabled): I think the human idea of how long an inline function can be is far more restrictive than what a processor can take. When looking at assembly code, you always find the code long, whereas in reality it is not that long for a processor. Besides, IMO, the proper way to uninline xnlock operations is to leave the non contended case inline, and to move the spinning out of line. This patch is not just about uninlining xnlock, that's only one half of the savings. The other one is irq-disabling via i-pipe. The problem with our case is that we have no simple single check to find out that we are on a fast ride. Rather, we have to do quite some calculations/lookups before the first check, and we have to perform multiple checks even in the best case. This is my fault, a tradeoff I made, I thought that the atomic_cmpxchg could be heavy on SMP systems, so I made a first check to see if we are not recursing. But we can do the two operations in one move if we accept to have a failing atomic_cmpxchg when recursing. I'm unsure about the cache pressure of cmpxchg vs. plain read. I guess the existing variant is already better. Moreover, the spinning code is only a fraction of the fraction. We cannot eliminate the recursion check, and we still have all the local_irq_save code. And _all_ this code mostly comes together, thus we save so much by uninlining those two functions. Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core