Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C

2010-10-07 Thread Stephen Boyd
 On 10/06/2010 01:05 PM, Nicolas Pitre wrote:
 You could use the noinline qualifier from linux/compiler.h with those 
 functions you don't want inlined.


That won't help me for the interleaving behavior though.

 Is it possible to do all this in assembly ? Can't you have the default
 implementation using this assembly with different function names, then
 just set the assembly function names in C code someplace?
 That weould be my preference too.  Being in assembly means that this 
 code is unlikely to change with different optimization levels and/or gcc 
 versions which would otherwise require different calibration values. 
 Relying on stable calibration is necessary for the lpj kernel cmdline 
 parameter to have some meaning.

Why doesn't any other architecture use assembly for their lpj code? They
may use headers with assembly in them or C code with assembly in them,
but they don't write all of the delay code in assembly and rely on
function interleaving. This leads me to believe other arches aren't
concerned about compiler optimizations breaking lpj cmdline parameters,
so why should ARM be concerned?

I tested the theory out and scaled down the CPU frequency to 19.2 MHz
and then called calibrate_delay(). Before and after applying this series
I got the same results.

Calibrating delay loop... 12.67 BogoMIPS (lpj=63360)

Jumping up to 1.2 GHz and calling calibrate_delay() gives me the same
before and after

Calibrating delay loop... 792.98 BogoMIPS (lpj=3964928)

I don't have access to a machine capable of running slower than 19.2
MHz. Maybe machines running in the KHz range would experience differences?

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

--
To unsubscribe from this list: send the line unsubscribe linux-arm-msm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C

2010-10-07 Thread Nicolas Pitre
On Thu, 7 Oct 2010, Stephen Boyd wrote:

 Why doesn't any other architecture use assembly for their lpj code? They
 may use headers with assembly in them or C code with assembly in them,
 but they don't write all of the delay code in assembly and rely on
 function interleaving. This leads me to believe other arches aren't
 concerned about compiler optimizations breaking lpj cmdline parameters,
 so why should ARM be concerned?
 
 I tested the theory out and scaled down the CPU frequency to 19.2 MHz
 and then called calibrate_delay(). Before and after applying this series
 I got the same results.
 
 Calibrating delay loop... 12.67 BogoMIPS (lpj=63360)
 
 Jumping up to 1.2 GHz and calling calibrate_delay() gives me the same
 before and after
 
 Calibrating delay loop... 792.98 BogoMIPS (lpj=3964928)

OK, fair enough.

 I don't have access to a machine capable of running slower than 19.2
 MHz. Maybe machines running in the KHz range would experience differences?

Don't worry, I doubt any ARM processor capable of running Linux ever was 
that slow.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe linux-arm-msm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C

2010-10-06 Thread Daniel Walker
On Tue, 2010-10-05 at 20:36 -0700, Stephen Boyd wrote:
 On 10/05/2010 10:22 AM, Daniel Walker wrote:
  You shouldn't really reference this series in this comment. You have to
  look at this as a changeset comment. So you really want to describe what
  this change is doing not what the over all series is doing.
 
  It would be better to say something like,
 
  We're implementing this in C to give us further flexibility in allowing
  overrides.
 
  But don't references next patch or this series , but you can do that
  in the intro email.
 
 
 Why not? This is a common thing to do when multiple patches are related
 to one topic. Doing a git log and searching for next patch shows
 others doing the same.

I would not say it's common .. I don't recall seeing many other series
which do this. You shouldn't do it because the patch description doesn't
stand on it own. What if this patch is accepted in isolation, and the
others rejected would your description make sense? Also people don't
always look at a series in git history, sometimes you only look at one
commit so saying next patch or this series can be confusing.

You can have an intro email where you can describe the whole series,
including what your intending for each patch to do.

 
   $ scripts/bloat-o-meter vmlinux.orig vmlinux.new
   add/remove: 0/0 grow/shrink: 2/0 up/down: 12/0 (12)
   function old new   delta
   __udelay  48  56  +8
   __const_udelay40  44  +4
 
  I think the size command might be better for this type of stuff. It
  should give you the same output (or similar).. It's just used more
  frequently. 
 
 Ok.
 
 $ size vmlinux.orig
textdata bss dec hex filename
 6533503  530232 1228296 8292031  7e86bf vmlinux.orig
 $ size vmlinux.new
textdata bss dec hex filename
 6533607  530232 1228296 8292135  7e8727 vmlinux.new
 
 I should mention GCC decided to inline __delay() into __udelay() and
 __const_udelay() and also decided to inline __const_udelay() into
 __delay() meaning we lost the function interleaving the assembly file
 had. I'll make a note of that and sorry for being misleading.
 
  Is this an -Os kernel, or -O2 ?
 
 -Os.
 
  +/*
  + * 0 = xloops = 0x7f06
  + * loops_per_jiffy = 0x01ff (max. 3355 bogomips)
 
  As long as your doing a re-write may as well add some real text here,
  and for the others.
 
 Any suggestions on what to add? I hoped the original comments would be
 good enough already considering the code is unchanged.

you might want to say what the purpose of the function is ..

  + */
  +void __const_udelay(unsigned long xloops)
  +{
  +  unsigned long lpj;
  +  unsigned long loops;
  +
  +  xloops = 14;  /* max = 0x01ff */
  +  lpj = loops_per_jiffy  10;/* max = 0x0001 */
  +  loops = lpj * xloops;   /* max = 0x7fff */
  +  loops = 6;/* max = 2^32-1 */
  +
  +  if (loops)
  +  __delay(loops);
 
  likely/unlikely ?
 
 likely. Although I'm doubtful on how much it will help considering the
 assembly and the code are equivalent already for this part of the code.

I don't know, shouldn't hurt tho.

Daniel


-- 
Sent by an consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum.


--
To unsubscribe from this list: send the line unsubscribe linux-arm-msm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C

2010-10-06 Thread Nicolas Pitre
On Tue, 5 Oct 2010, Stephen Boyd wrote:

  On 10/05/2010 10:22 AM, Daniel Walker wrote:
  I think the size command might be better for this type of stuff. It
  should give you the same output (or similar).. It's just used more
  frequently. 
 
 Ok.
 
 $ size vmlinux.orig
textdata bss dec hex filename
 6533503  530232 1228296 8292031  7e86bf vmlinux.orig
 $ size vmlinux.new
textdata bss dec hex filename
 6533607  530232 1228296 8292135  7e8727 vmlinux.new

If you modified only one source file then I'd suggest you run 'size' 
only on the affected .o file instead.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe linux-arm-msm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C

2010-10-06 Thread Stephen Boyd
 On 10/06/2010 07:26 AM, Nicolas Pitre wrote:
 Ok.

 $ size vmlinux.orig
textdata bss dec hex filename
 6533503  530232 1228296 8292031  7e86bf vmlinux.orig
 $ size vmlinux.new
textdata bss dec hex filename
 6533607  530232 1228296 8292135  7e8727 vmlinux.new

 If you modified only one source file then I'd suggest you run 'size' 
 only on the affected .o file instead.

Ok.

$ size delay.o.orig
   textdata bss dec hex filename
 56   0   0  56  38 delay.o.orig
$ size delay.o.new
   textdata bss dec hex filename
192   0   0 192  c0 delay.o.new

Perhaps I should mark __delay and __const_udelay as noinline?

$ size delay.o.noinline
   textdata bss dec hex filename
144   0   0 144  90 delay.o.noinline

Now we get backwards branching.

$ objdump -d delay.o.noinline

Disassembly of section .text:

 __delay:
   0:   e251subsr0, r0, #1  ; 0x1
   4:   8afdbhi 0 __delay
   8:   e12fff1ebx  lr

000c __const_udelay:
   c:   e59f3018ldr r3, [pc, #24]   ; 2c __const_udelay+0x20
  10:   e1a00720lsr r0, r0, #14
  14:   e5933000ldr r3, [r3]
  18:   e1a03523lsr r3, r3, #10
  1c:   e093mul r0, r3, r0
  20:   e1b00320lsrsr0, r0, #6
  24:   012fff1ebxeqlr
  28:   eafeb   0 __delay
  2c:   .word   0x

0030 __udelay:
  30:   e59f3004ldr r3, [pc, #4]; 3c __udelay+0xc
  34:   e093mul r0, r3, r0
  38:   eafeb   c __const_udelay
  3c:   0001a36e.word   0x0001a36e

Is there some way to force GCC to do what I want (interleave the
functions)? It seems happy to inline them and then optimize the register
usage and instruction ordering. Perhaps that is OK though and we're
wasting our time trying to be conservative in code size.

$ objdump -d delay.o.new

Disassembly of section .text:

 __delay:
   0:   e251subsr0, r0, #1  ; 0x1
   4:   8afdbhi 0 __delay
   8:   e12fff1ebx  lr

000c __const_udelay:
   c:   e59f3020ldr r3, [pc, #32]   ; 34 __const_udelay+0x28
  10:   e1a00720lsr r0, r0, #14
  14:   e5933000ldr r3, [r3]
  18:   e1a03523lsr r3, r3, #10
  1c:   e093mul r0, r3, r0
  20:   e1b00320lsrsr0, r0, #6
  24:   012fff1ebxeqlr
  28:   e251subsr0, r0, #1  ; 0x1
  2c:   8afdbhi 28 __const_udelay+0x1c
  30:   e12fff1ebx  lr
  34:   .word   0x

0038 __udelay:
  38:   e59f3028ldr r3, [pc, #40]   ; 68 __udelay+0x30
  3c:   e59f2028ldr r2, [pc, #40]   ; 6c __udelay+0x34
  40:   e0030093mul r3, r3, r0
  44:   e5922000ldr r2, [r2]
  48:   e1a02522lsr r2, r2, #10
  4c:   e1a03723lsr r3, r3, #14
  50:   e0030392mul r3, r2, r3
  54:   e1b03323lsrsr3, r3, #6
  58:   012fff1ebxeqlr
  5c:   e2533001subsr3, r3, #1  ; 0x1
  60:   8afdbhi 5c __udelay+0x24
  64:   e12fff1ebx  lr
  68:   0001a36e.word   0x0001a36e
  6c:   .word   0x

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

--
To unsubscribe from this list: send the line unsubscribe linux-arm-msm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C

2010-10-06 Thread Daniel Walker
On Wed, 2010-10-06 at 11:30 -0700, Stephen Boyd wrote:
 Is there some way to force GCC to do what I want (interleave the
 functions)? It seems happy to inline them and then optimize the
 register
 usage and instruction ordering. Perhaps that is OK though and we're
 wasting our time trying to be conservative in code size. 

Is it possible to do all this in assembly ? Can't you have the default
implementation using this assembly with different function names, then
just set the assembly function names in C code someplace?

Daniel

-- 
Sent by an consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum.


--
To unsubscribe from this list: send the line unsubscribe linux-arm-msm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C

2010-10-06 Thread Nicolas Pitre
On Wed, 6 Oct 2010, Daniel Walker wrote:

 On Wed, 2010-10-06 at 11:30 -0700, Stephen Boyd wrote:
  Is there some way to force GCC to do what I want (interleave the
  functions)? It seems happy to inline them and then optimize the
  register
  usage and instruction ordering. Perhaps that is OK though and we're
  wasting our time trying to be conservative in code size. 

You could use the noinline qualifier from linux/compiler.h with those 
functions you don't want inlined.

 Is it possible to do all this in assembly ? Can't you have the default
 implementation using this assembly with different function names, then
 just set the assembly function names in C code someplace?

That weould be my preference too.  Being in assembly means that this 
code is unlikely to change with different optimization levels and/or gcc 
versions which would otherwise require different calibration values. 
Relying on stable calibration is necessary for the lpj kernel cmdline 
parameter to have some meaning.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe linux-arm-msm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] [ARM] Translate delay.S into (mostly) C

2010-10-05 Thread Daniel Walker
On Mon, 2010-09-27 at 20:33 -0700, Stephen Boyd wrote:
 In the next patch we're going to allow machines to override the
 __delay() implementation at runtime so they can implement a timer
 based __delay() routine. It's easier to do this using C, so lets
 write udelay and friends in C.

You shouldn't really reference this series in this comment. You have to
look at this as a changeset comment. So you really want to describe what
this change is doing not what the over all series is doing.

It would be better to say something like,

We're implementing this in C to give us further flexibility in allowing
overrides.

But don't references next patch or this series , but you can do that
in the intro email.

 We lose the #if 0 code, which according to Russell is used to
 make the delay loop more stable and predictable on older CPUs
 (see http://article.gmane.org/gmane.linux.kernel/67 for more
 info). We shouldn't be too worried though, since the next patch
 adds functionality to allow a machine to set the __delay() loop
 themselves, therefore allowing machines to resurrect the
 commented out code if they need it.
 
 bloat-o-meter shows an increase of 12 bytes. Further inspection
 of the assembly shows GCC copying the loops_per_jiffy pointer and
 the magic HZ value to the ends of __const_udelay() and _delay()
 thus contributing an extra 4 and 8 bytes of data to each
 function. These two values weren't taken into account in the
 delay.S version since they weren't part of the function in nm's
 eyes. This means we only really gained an extra 4 bytes due to
 GCC's decision to duplicate the loops_per_jiffy pointer in
 __const_udelay.
 
  $ scripts/bloat-o-meter vmlinux.orig vmlinux.new
  add/remove: 0/0 grow/shrink: 2/0 up/down: 12/0 (12)
  function old new   delta
  __udelay  48  56  +8
  __const_udelay40  44  +4

I think the size command might be better for this type of stuff. It
should give you the same output (or similar).. It's just used more
frequently. 

Is this an -Os kernel, or -O2 ?

 Signed-off-by: Stephen Boyd sb...@codeaurora.org
 Reviewed-by: Saravana Kannan skan...@codeaurora.org
 ---
 
 So maybe we can make the magic HZ value into a #define? UDELAY_SCALAR?
 UDELAY_MAGIC?
 
  arch/arm/include/asm/delay.h |2 +-
  arch/arm/kernel/armksyms.c   |4 --
  arch/arm/lib/delay.S |   65 
 --
  arch/arm/lib/delay.c |   57 
  4 files changed, 58 insertions(+), 70 deletions(-)
  delete mode 100644 arch/arm/lib/delay.S
  create mode 100644 arch/arm/lib/delay.c
 
 diff --git a/arch/arm/include/asm/delay.h b/arch/arm/include/asm/delay.h
 index b2deda1..ccc5ed5 100644
 --- a/arch/arm/include/asm/delay.h
 +++ b/arch/arm/include/asm/delay.h
 @@ -8,7 +8,7 @@
  
  #include asm/param.h   /* HZ */
  
 -extern void __delay(int loops);
 +extern void __delay(unsigned long loops);
  
  /*
   * This function intentionally does not exist; if you see references to
 diff --git a/arch/arm/kernel/armksyms.c b/arch/arm/kernel/armksyms.c
 index 8214bfe..259e549 100644
 --- a/arch/arm/kernel/armksyms.c
 +++ b/arch/arm/kernel/armksyms.c
 @@ -52,10 +52,6 @@ extern void fpundefinstr(void);
  
  EXPORT_SYMBOL(__backtrace);
  
 - /* platform dependent support */
 -EXPORT_SYMBOL(__udelay);
 -EXPORT_SYMBOL(__const_udelay);
 -
   /* networking */
  EXPORT_SYMBOL(csum_partial);
  EXPORT_SYMBOL(csum_partial_copy_from_user);
 diff --git a/arch/arm/lib/delay.S b/arch/arm/lib/delay.S
 deleted file mode 100644
 index 8d6a876..000
 --- a/arch/arm/lib/delay.S
 +++ /dev/null
 @@ -1,65 +0,0 @@
 -/*
 - *  linux/arch/arm/lib/delay.S
 - *
 - *  Copyright (C) 1995, 1996 Russell King
 - *
 - * This program is free software; you can redistribute it and/or modify
 - * it under the terms of the GNU General Public License version 2 as
 - * published by the Free Software Foundation.
 - */
 -#include linux/linkage.h
 -#include asm/assembler.h
 -#include asm/param.h
 - .text
 -
 -.LC0:.word   loops_per_jiffy
 -.LC1:.word   (2199023*HZ)11
 -
 -/*
 - * r0  = 2000
 - * lpj = 0x01ff (max. 3355 bogomips)
 - * HZ  = 1000
 - */
 -
 -ENTRY(__udelay)
 - ldr r2, .LC1
 - mul r0, r2, r0
 -ENTRY(__const_udelay)@ 0 = r0 = 0x7f06
 - ldr r2, .LC0
 - ldr r2, [r2]@ max = 0x01ff
 - mov r0, r0, lsr #14 @ max = 0x0001
 - mov r2, r2, lsr #10 @ max = 0x7fff
 - mul r0, r2, r0  @ max = 2^32-1
 - movsr0, r0, lsr #6
 - moveq   pc, lr
 -
 -/*
 - * loops = r0 * HZ * loops_per_jiffy / 100
 - *
 - * Oh, if only we had a cycle counter...
 - */
 -
 -@ Delay routine
 -ENTRY(__delay)
 -