Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C
On 10/06/2010 01:05 PM, Nicolas Pitre wrote: You could use the noinline qualifier from linux/compiler.h with those functions you don't want inlined. That won't help me for the interleaving behavior though. Is it possible to do all this in assembly ? Can't you have the default implementation using this assembly with different function names, then just set the assembly function names in C code someplace? That weould be my preference too. Being in assembly means that this code is unlikely to change with different optimization levels and/or gcc versions which would otherwise require different calibration values. Relying on stable calibration is necessary for the lpj kernel cmdline parameter to have some meaning. Why doesn't any other architecture use assembly for their lpj code? They may use headers with assembly in them or C code with assembly in them, but they don't write all of the delay code in assembly and rely on function interleaving. This leads me to believe other arches aren't concerned about compiler optimizations breaking lpj cmdline parameters, so why should ARM be concerned? I tested the theory out and scaled down the CPU frequency to 19.2 MHz and then called calibrate_delay(). Before and after applying this series I got the same results. Calibrating delay loop... 12.67 BogoMIPS (lpj=63360) Jumping up to 1.2 GHz and calling calibrate_delay() gives me the same before and after Calibrating delay loop... 792.98 BogoMIPS (lpj=3964928) I don't have access to a machine capable of running slower than 19.2 MHz. Maybe machines running in the KHz range would experience differences? -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. -- To unsubscribe from this list: send the line unsubscribe linux-arm-msm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C
On Thu, 7 Oct 2010, Stephen Boyd wrote: Why doesn't any other architecture use assembly for their lpj code? They may use headers with assembly in them or C code with assembly in them, but they don't write all of the delay code in assembly and rely on function interleaving. This leads me to believe other arches aren't concerned about compiler optimizations breaking lpj cmdline parameters, so why should ARM be concerned? I tested the theory out and scaled down the CPU frequency to 19.2 MHz and then called calibrate_delay(). Before and after applying this series I got the same results. Calibrating delay loop... 12.67 BogoMIPS (lpj=63360) Jumping up to 1.2 GHz and calling calibrate_delay() gives me the same before and after Calibrating delay loop... 792.98 BogoMIPS (lpj=3964928) OK, fair enough. I don't have access to a machine capable of running slower than 19.2 MHz. Maybe machines running in the KHz range would experience differences? Don't worry, I doubt any ARM processor capable of running Linux ever was that slow. Nicolas -- To unsubscribe from this list: send the line unsubscribe linux-arm-msm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C
On Tue, 2010-10-05 at 20:36 -0700, Stephen Boyd wrote: On 10/05/2010 10:22 AM, Daniel Walker wrote: You shouldn't really reference this series in this comment. You have to look at this as a changeset comment. So you really want to describe what this change is doing not what the over all series is doing. It would be better to say something like, We're implementing this in C to give us further flexibility in allowing overrides. But don't references next patch or this series , but you can do that in the intro email. Why not? This is a common thing to do when multiple patches are related to one topic. Doing a git log and searching for next patch shows others doing the same. I would not say it's common .. I don't recall seeing many other series which do this. You shouldn't do it because the patch description doesn't stand on it own. What if this patch is accepted in isolation, and the others rejected would your description make sense? Also people don't always look at a series in git history, sometimes you only look at one commit so saying next patch or this series can be confusing. You can have an intro email where you can describe the whole series, including what your intending for each patch to do. $ scripts/bloat-o-meter vmlinux.orig vmlinux.new add/remove: 0/0 grow/shrink: 2/0 up/down: 12/0 (12) function old new delta __udelay 48 56 +8 __const_udelay40 44 +4 I think the size command might be better for this type of stuff. It should give you the same output (or similar).. It's just used more frequently. Ok. $ size vmlinux.orig textdata bss dec hex filename 6533503 530232 1228296 8292031 7e86bf vmlinux.orig $ size vmlinux.new textdata bss dec hex filename 6533607 530232 1228296 8292135 7e8727 vmlinux.new I should mention GCC decided to inline __delay() into __udelay() and __const_udelay() and also decided to inline __const_udelay() into __delay() meaning we lost the function interleaving the assembly file had. I'll make a note of that and sorry for being misleading. Is this an -Os kernel, or -O2 ? -Os. +/* + * 0 = xloops = 0x7f06 + * loops_per_jiffy = 0x01ff (max. 3355 bogomips) As long as your doing a re-write may as well add some real text here, and for the others. Any suggestions on what to add? I hoped the original comments would be good enough already considering the code is unchanged. you might want to say what the purpose of the function is .. + */ +void __const_udelay(unsigned long xloops) +{ + unsigned long lpj; + unsigned long loops; + + xloops = 14; /* max = 0x01ff */ + lpj = loops_per_jiffy 10;/* max = 0x0001 */ + loops = lpj * xloops; /* max = 0x7fff */ + loops = 6;/* max = 2^32-1 */ + + if (loops) + __delay(loops); likely/unlikely ? likely. Although I'm doubtful on how much it will help considering the assembly and the code are equivalent already for this part of the code. I don't know, shouldn't hurt tho. Daniel -- Sent by an consultant of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. -- To unsubscribe from this list: send the line unsubscribe linux-arm-msm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C
On Tue, 5 Oct 2010, Stephen Boyd wrote: On 10/05/2010 10:22 AM, Daniel Walker wrote: I think the size command might be better for this type of stuff. It should give you the same output (or similar).. It's just used more frequently. Ok. $ size vmlinux.orig textdata bss dec hex filename 6533503 530232 1228296 8292031 7e86bf vmlinux.orig $ size vmlinux.new textdata bss dec hex filename 6533607 530232 1228296 8292135 7e8727 vmlinux.new If you modified only one source file then I'd suggest you run 'size' only on the affected .o file instead. Nicolas -- To unsubscribe from this list: send the line unsubscribe linux-arm-msm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C
On 10/06/2010 07:26 AM, Nicolas Pitre wrote: Ok. $ size vmlinux.orig textdata bss dec hex filename 6533503 530232 1228296 8292031 7e86bf vmlinux.orig $ size vmlinux.new textdata bss dec hex filename 6533607 530232 1228296 8292135 7e8727 vmlinux.new If you modified only one source file then I'd suggest you run 'size' only on the affected .o file instead. Ok. $ size delay.o.orig textdata bss dec hex filename 56 0 0 56 38 delay.o.orig $ size delay.o.new textdata bss dec hex filename 192 0 0 192 c0 delay.o.new Perhaps I should mark __delay and __const_udelay as noinline? $ size delay.o.noinline textdata bss dec hex filename 144 0 0 144 90 delay.o.noinline Now we get backwards branching. $ objdump -d delay.o.noinline Disassembly of section .text: __delay: 0: e251subsr0, r0, #1 ; 0x1 4: 8afdbhi 0 __delay 8: e12fff1ebx lr 000c __const_udelay: c: e59f3018ldr r3, [pc, #24] ; 2c __const_udelay+0x20 10: e1a00720lsr r0, r0, #14 14: e5933000ldr r3, [r3] 18: e1a03523lsr r3, r3, #10 1c: e093mul r0, r3, r0 20: e1b00320lsrsr0, r0, #6 24: 012fff1ebxeqlr 28: eafeb 0 __delay 2c: .word 0x 0030 __udelay: 30: e59f3004ldr r3, [pc, #4]; 3c __udelay+0xc 34: e093mul r0, r3, r0 38: eafeb c __const_udelay 3c: 0001a36e.word 0x0001a36e Is there some way to force GCC to do what I want (interleave the functions)? It seems happy to inline them and then optimize the register usage and instruction ordering. Perhaps that is OK though and we're wasting our time trying to be conservative in code size. $ objdump -d delay.o.new Disassembly of section .text: __delay: 0: e251subsr0, r0, #1 ; 0x1 4: 8afdbhi 0 __delay 8: e12fff1ebx lr 000c __const_udelay: c: e59f3020ldr r3, [pc, #32] ; 34 __const_udelay+0x28 10: e1a00720lsr r0, r0, #14 14: e5933000ldr r3, [r3] 18: e1a03523lsr r3, r3, #10 1c: e093mul r0, r3, r0 20: e1b00320lsrsr0, r0, #6 24: 012fff1ebxeqlr 28: e251subsr0, r0, #1 ; 0x1 2c: 8afdbhi 28 __const_udelay+0x1c 30: e12fff1ebx lr 34: .word 0x 0038 __udelay: 38: e59f3028ldr r3, [pc, #40] ; 68 __udelay+0x30 3c: e59f2028ldr r2, [pc, #40] ; 6c __udelay+0x34 40: e0030093mul r3, r3, r0 44: e5922000ldr r2, [r2] 48: e1a02522lsr r2, r2, #10 4c: e1a03723lsr r3, r3, #14 50: e0030392mul r3, r2, r3 54: e1b03323lsrsr3, r3, #6 58: 012fff1ebxeqlr 5c: e2533001subsr3, r3, #1 ; 0x1 60: 8afdbhi 5c __udelay+0x24 64: e12fff1ebx lr 68: 0001a36e.word 0x0001a36e 6c: .word 0x -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. -- To unsubscribe from this list: send the line unsubscribe linux-arm-msm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C
On Wed, 2010-10-06 at 11:30 -0700, Stephen Boyd wrote: Is there some way to force GCC to do what I want (interleave the functions)? It seems happy to inline them and then optimize the register usage and instruction ordering. Perhaps that is OK though and we're wasting our time trying to be conservative in code size. Is it possible to do all this in assembly ? Can't you have the default implementation using this assembly with different function names, then just set the assembly function names in C code someplace? Daniel -- Sent by an consultant of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. -- To unsubscribe from this list: send the line unsubscribe linux-arm-msm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C
On Wed, 6 Oct 2010, Daniel Walker wrote: On Wed, 2010-10-06 at 11:30 -0700, Stephen Boyd wrote: Is there some way to force GCC to do what I want (interleave the functions)? It seems happy to inline them and then optimize the register usage and instruction ordering. Perhaps that is OK though and we're wasting our time trying to be conservative in code size. You could use the noinline qualifier from linux/compiler.h with those functions you don't want inlined. Is it possible to do all this in assembly ? Can't you have the default implementation using this assembly with different function names, then just set the assembly function names in C code someplace? That weould be my preference too. Being in assembly means that this code is unlikely to change with different optimization levels and/or gcc versions which would otherwise require different calibration values. Relying on stable calibration is necessary for the lpj kernel cmdline parameter to have some meaning. Nicolas -- To unsubscribe from this list: send the line unsubscribe linux-arm-msm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C
On Mon, 2010-09-27 at 20:33 -0700, Stephen Boyd wrote: In the next patch we're going to allow machines to override the __delay() implementation at runtime so they can implement a timer based __delay() routine. It's easier to do this using C, so lets write udelay and friends in C. You shouldn't really reference this series in this comment. You have to look at this as a changeset comment. So you really want to describe what this change is doing not what the over all series is doing. It would be better to say something like, We're implementing this in C to give us further flexibility in allowing overrides. But don't references next patch or this series , but you can do that in the intro email. We lose the #if 0 code, which according to Russell is used to make the delay loop more stable and predictable on older CPUs (see http://article.gmane.org/gmane.linux.kernel/67 for more info). We shouldn't be too worried though, since the next patch adds functionality to allow a machine to set the __delay() loop themselves, therefore allowing machines to resurrect the commented out code if they need it. bloat-o-meter shows an increase of 12 bytes. Further inspection of the assembly shows GCC copying the loops_per_jiffy pointer and the magic HZ value to the ends of __const_udelay() and _delay() thus contributing an extra 4 and 8 bytes of data to each function. These two values weren't taken into account in the delay.S version since they weren't part of the function in nm's eyes. This means we only really gained an extra 4 bytes due to GCC's decision to duplicate the loops_per_jiffy pointer in __const_udelay. $ scripts/bloat-o-meter vmlinux.orig vmlinux.new add/remove: 0/0 grow/shrink: 2/0 up/down: 12/0 (12) function old new delta __udelay 48 56 +8 __const_udelay40 44 +4 I think the size command might be better for this type of stuff. It should give you the same output (or similar).. It's just used more frequently. Is this an -Os kernel, or -O2 ? Signed-off-by: Stephen Boyd sb...@codeaurora.org Reviewed-by: Saravana Kannan skan...@codeaurora.org --- So maybe we can make the magic HZ value into a #define? UDELAY_SCALAR? UDELAY_MAGIC? arch/arm/include/asm/delay.h |2 +- arch/arm/kernel/armksyms.c |4 -- arch/arm/lib/delay.S | 65 -- arch/arm/lib/delay.c | 57 4 files changed, 58 insertions(+), 70 deletions(-) delete mode 100644 arch/arm/lib/delay.S create mode 100644 arch/arm/lib/delay.c diff --git a/arch/arm/include/asm/delay.h b/arch/arm/include/asm/delay.h index b2deda1..ccc5ed5 100644 --- a/arch/arm/include/asm/delay.h +++ b/arch/arm/include/asm/delay.h @@ -8,7 +8,7 @@ #include asm/param.h /* HZ */ -extern void __delay(int loops); +extern void __delay(unsigned long loops); /* * This function intentionally does not exist; if you see references to diff --git a/arch/arm/kernel/armksyms.c b/arch/arm/kernel/armksyms.c index 8214bfe..259e549 100644 --- a/arch/arm/kernel/armksyms.c +++ b/arch/arm/kernel/armksyms.c @@ -52,10 +52,6 @@ extern void fpundefinstr(void); EXPORT_SYMBOL(__backtrace); - /* platform dependent support */ -EXPORT_SYMBOL(__udelay); -EXPORT_SYMBOL(__const_udelay); - /* networking */ EXPORT_SYMBOL(csum_partial); EXPORT_SYMBOL(csum_partial_copy_from_user); diff --git a/arch/arm/lib/delay.S b/arch/arm/lib/delay.S deleted file mode 100644 index 8d6a876..000 --- a/arch/arm/lib/delay.S +++ /dev/null @@ -1,65 +0,0 @@ -/* - * linux/arch/arm/lib/delay.S - * - * Copyright (C) 1995, 1996 Russell King - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - */ -#include linux/linkage.h -#include asm/assembler.h -#include asm/param.h - .text - -.LC0:.word loops_per_jiffy -.LC1:.word (2199023*HZ)11 - -/* - * r0 = 2000 - * lpj = 0x01ff (max. 3355 bogomips) - * HZ = 1000 - */ - -ENTRY(__udelay) - ldr r2, .LC1 - mul r0, r2, r0 -ENTRY(__const_udelay)@ 0 = r0 = 0x7f06 - ldr r2, .LC0 - ldr r2, [r2]@ max = 0x01ff - mov r0, r0, lsr #14 @ max = 0x0001 - mov r2, r2, lsr #10 @ max = 0x7fff - mul r0, r2, r0 @ max = 2^32-1 - movsr0, r0, lsr #6 - moveq pc, lr - -/* - * loops = r0 * HZ * loops_per_jiffy / 100 - * - * Oh, if only we had a cycle counter... - */ - -@ Delay routine -ENTRY(__delay) -