Re: [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()
Am Montag, 27. März 2017, 23:16:45 CEST schrieb Alexander Graf: > > On 27/03/2017 17:17, Heiko Stuebner wrote: > > Am Montag, 27. März 2017, 09:14:47 CEST schrieb Alexander Graf: > >> > >> On 27/03/2017 01:38, Simon Glass wrote: > >>> Most of the time the optimised memset() is what we want. For extreme > >>> situations such as TPL it may be too large. For example on the 'rock' > >>> board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and > >>> the rodata bug, this patch is enough to reduce the TPL image below the > >>> limit. > >>> > >>> Signed-off-by: Simon Glass> >>> --- > >>> > >>> lib/Kconfig | 9 + > >>> lib/string.c | 6 -- > >>> 2 files changed, 13 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/lib/Kconfig b/lib/Kconfig > >>> index 65c01573e1..5bf512d8c0 100644 > >>> --- a/lib/Kconfig > >>> +++ b/lib/Kconfig > >>> @@ -52,6 +52,15 @@ config LIB_RAND > >>> help > >>> This library provides pseudo-random number generator functions. > >>> > >>> +config FAST_MEMSET > >>> + bool "Use an optimised memset()" > >>> + default y > >>> + help > >>> + The faster memset() is the arch-specific one (if available) enabled > >>> + by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get > >>> + better performance by write a word at a time. Disable this option > >>> + to reduce code size slightly at the cost of some speed. > >> > >> The comment sounds slightly confused - it took me a few times of reading > >> it until I grasped what it was trying to tell me :). > >> > >>> + > >>> source lib/dhry/Kconfig > >>> > >>> source lib/rsa/Kconfig > >>> diff --git a/lib/string.c b/lib/string.c > >>> index 67d5f6a421..159493ed17 100644 > >>> --- a/lib/string.c > >>> +++ b/lib/string.c > >>> @@ -437,8 +437,10 @@ char *strswab(const char *s) > >>> void * memset(void * s,int c,size_t count) > >>> { > >>> unsigned long *sl = (unsigned long *) s; > >>> - unsigned long cl = 0; > >>> char *s8; > >>> + > >>> +#ifdef CONFIG_FAST_MEMSET > >>> + unsigned long cl = 0; > >>> int i; > >>> > >>> /* do it one word at a time (32 bits or 64 bits) while possible */ > >>> @@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count) > >>> count -= sizeof(*sl); > >>> } > >>> } > >>> - /* fill 8 bits at a time */ > >>> +#endif /* fill 8 bits at a time */ > >> > >> So while this is all neat, a few ideas: > >> > >> 1) Would having memset in a header improve things even more? After all, > >> each external function call clobbers registers that you need to > >> save/restore... > > > > I'd guess it really depends on the size constraints. The regular > > libgeneric memset compiles on my rk3188 tpl to a total of > > 64bytes on both gcc-4.9 and gcc-6.3 while Simon's fast-memset > > comes down to 14bytes on my rk3188. > > > > On the rk3188 the only memset user is board_init_f, so here memset > > is called only once without needing to save registers and I'd guess if an > > implementation really is that size-constrained to worry about 50bytes > > this one caller will probably always be the only one? > > I'm not sure I follow. If you put it into a header, the compiler has a > better chance of evicting untaken code paths and optimize register usage > over object linked variants (unless you use GOLD). I was mostly > wondering whether that would already give you the savings without > introducing a complicated #ifdef that is going to bitrot over time :). On rk3188-tpl that small non-fast memset gets compiled to (bfd linker): 100809aa : 100809aa: b510push{r4, lr} 100809ac: 22c0movsr2, #192; 0xc0 100809ae: 2100movsr1, #0 100809b0: 4604mov r4, r0 100809b2: f000 f804 bl 100809be 100809b6: 34c0addsr4, #192; 0xc0 100809b8: f8c9 4090 str.w r4, [r9, #144] ; 0x90 100809bc: bd10pop {r4, pc} 100809be : 100809be: 4402add r2, r0 100809c0: 4603mov r3, r0 100809c2: 4293cmp r3, r2 100809c4: d100bne.n 100809c8 100809c6: 4770bx lr 100809c8: f803 1b01 strb.w r1, [r3], #1 100809cc: e7f9b.n 100809c2 not saving any outside registers, as it's used only once at all and what I was trying to say was that in cases where we worry about having the tiniest memset possible, I guess that will most likely stay the only call. But I may have been dug into the rk3188 tpl-specifics to long, to see other possible cases right now :-) . > I'm just slightly worried about the massive number of preprocessor > excludes that happen in U-Boot in general. It seems like something > that's really hard to ever have full testing coverage on. That's essentially what I was worried about as well, seeing that memset can be provided by
Re: [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()
On 27/03/2017 17:17, Heiko Stuebner wrote: Am Montag, 27. März 2017, 09:14:47 CEST schrieb Alexander Graf: On 27/03/2017 01:38, Simon Glass wrote: Most of the time the optimised memset() is what we want. For extreme situations such as TPL it may be too large. For example on the 'rock' board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and the rodata bug, this patch is enough to reduce the TPL image below the limit. Signed-off-by: Simon Glass--- lib/Kconfig | 9 + lib/string.c | 6 -- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/lib/Kconfig b/lib/Kconfig index 65c01573e1..5bf512d8c0 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -52,6 +52,15 @@ config LIB_RAND help This library provides pseudo-random number generator functions. +config FAST_MEMSET + bool "Use an optimised memset()" + default y + help + The faster memset() is the arch-specific one (if available) enabled + by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get + better performance by write a word at a time. Disable this option + to reduce code size slightly at the cost of some speed. The comment sounds slightly confused - it took me a few times of reading it until I grasped what it was trying to tell me :). + source lib/dhry/Kconfig source lib/rsa/Kconfig diff --git a/lib/string.c b/lib/string.c index 67d5f6a421..159493ed17 100644 --- a/lib/string.c +++ b/lib/string.c @@ -437,8 +437,10 @@ char *strswab(const char *s) void * memset(void * s,int c,size_t count) { unsigned long *sl = (unsigned long *) s; - unsigned long cl = 0; char *s8; + +#ifdef CONFIG_FAST_MEMSET + unsigned long cl = 0; int i; /* do it one word at a time (32 bits or 64 bits) while possible */ @@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count) count -= sizeof(*sl); } } - /* fill 8 bits at a time */ +#endif /* fill 8 bits at a time */ So while this is all neat, a few ideas: 1) Would having memset in a header improve things even more? After all, each external function call clobbers registers that you need to save/restore... I'd guess it really depends on the size constraints. The regular libgeneric memset compiles on my rk3188 tpl to a total of 64bytes on both gcc-4.9 and gcc-6.3 while Simon's fast-memset comes down to 14bytes on my rk3188. On the rk3188 the only memset user is board_init_f, so here memset is called only once without needing to save registers and I'd guess if an implementation really is that size-constrained to worry about 50bytes this one caller will probably always be the only one? I'm not sure I follow. If you put it into a header, the compiler has a better chance of evicting untaken code paths and optimize register usage over object linked variants (unless you use GOLD). I was mostly wondering whether that would already give you the savings without introducing a complicated #ifdef that is going to bitrot over time :). I'm just slightly worried about the massive number of preprocessor excludes that happen in U-Boot in general. It seems like something that's really hard to ever have full testing coverage on. 2) How much would GOLD save you? Have you tried? U-Boot is small enough of a code base that global optimizations should be able to give significant size savings. I think the issue that this is trying to solve is to allow more toolchains to be used and thus make rebuilds on changes work on a lot of boards at the same time with random toolchains. gcc-6.3 already produces way smaller results (well within the size constraints the rk3188 has) than for example the gcc-4.9 used by buildman as baseline toolchain. Ah, I see. So 4.9 does not have -lto? There's a good chance my gut feeling that GOLD actually saves anything is wrong - I don't know. Has anyone done the numbers? Then we would have something to actually base gut feeling on. Size is always a serious constraint in U-Boot, especially in SPL environments. If we can include one more tool in our portfolio to optimize size across the board, I'm all for it. This patch just feels slightly short-term - but I'm definitely not nack'ing it :). Alex ___ U-Boot mailing list U-Boot@lists.denx.de https://lists.denx.de/listinfo/u-boot
Re: [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()
Am Sonntag, 26. März 2017, 17:38:16 CEST schrieb Simon Glass: > Most of the time the optimised memset() is what we want. For extreme > situations such as TPL it may be too large. For example on the 'rock' > board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and > the rodata bug, this patch is enough to reduce the TPL image below the > limit. > > Signed-off-by: Simon GlassThis brings down the rk3188-rock tpl from 1020 to 972 bytes (with a 1020 byte size limit for the tpl) even with gcc-4.9 and down to 748 bytes on gcc-6.3. I was using the original memset in all tests before, so am quite sure it should work without issues, but cannot test it on actual hardware this week. Heiko > --- > > lib/Kconfig | 9 + > lib/string.c | 6 -- > 2 files changed, 13 insertions(+), 2 deletions(-) > > diff --git a/lib/Kconfig b/lib/Kconfig > index 65c01573e1..5bf512d8c0 100644 > --- a/lib/Kconfig > +++ b/lib/Kconfig > @@ -52,6 +52,15 @@ config LIB_RAND > help > This library provides pseudo-random number generator functions. > > +config FAST_MEMSET > + bool "Use an optimised memset()" > + default y > + help > + The faster memset() is the arch-specific one (if available) enabled > + by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get > + better performance by write a word at a time. Disable this option > + to reduce code size slightly at the cost of some speed. > + > source lib/dhry/Kconfig > > source lib/rsa/Kconfig > diff --git a/lib/string.c b/lib/string.c > index 67d5f6a421..159493ed17 100644 > --- a/lib/string.c > +++ b/lib/string.c > @@ -437,8 +437,10 @@ char *strswab(const char *s) > void * memset(void * s,int c,size_t count) > { > unsigned long *sl = (unsigned long *) s; > - unsigned long cl = 0; > char *s8; > + > +#ifdef CONFIG_FAST_MEMSET > + unsigned long cl = 0; > int i; > > /* do it one word at a time (32 bits or 64 bits) while possible */ > @@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count) > count -= sizeof(*sl); > } > } > - /* fill 8 bits at a time */ > +#endif /* fill 8 bits at a time */ > s8 = (char *)sl; > while (count--) > *s8++ = c; > ___ U-Boot mailing list U-Boot@lists.denx.de https://lists.denx.de/listinfo/u-boot
Re: [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()
Am Montag, 27. März 2017, 09:14:47 CEST schrieb Alexander Graf: > > On 27/03/2017 01:38, Simon Glass wrote: > > Most of the time the optimised memset() is what we want. For extreme > > situations such as TPL it may be too large. For example on the 'rock' > > board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and > > the rodata bug, this patch is enough to reduce the TPL image below the > > limit. > > > > Signed-off-by: Simon Glass> > --- > > > > lib/Kconfig | 9 + > > lib/string.c | 6 -- > > 2 files changed, 13 insertions(+), 2 deletions(-) > > > > diff --git a/lib/Kconfig b/lib/Kconfig > > index 65c01573e1..5bf512d8c0 100644 > > --- a/lib/Kconfig > > +++ b/lib/Kconfig > > @@ -52,6 +52,15 @@ config LIB_RAND > > help > > This library provides pseudo-random number generator functions. > > > > +config FAST_MEMSET > > + bool "Use an optimised memset()" > > + default y > > + help > > + The faster memset() is the arch-specific one (if available) enabled > > + by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get > > + better performance by write a word at a time. Disable this option > > + to reduce code size slightly at the cost of some speed. > > The comment sounds slightly confused - it took me a few times of reading > it until I grasped what it was trying to tell me :). > > > + > > source lib/dhry/Kconfig > > > > source lib/rsa/Kconfig > > diff --git a/lib/string.c b/lib/string.c > > index 67d5f6a421..159493ed17 100644 > > --- a/lib/string.c > > +++ b/lib/string.c > > @@ -437,8 +437,10 @@ char *strswab(const char *s) > > void * memset(void * s,int c,size_t count) > > { > > unsigned long *sl = (unsigned long *) s; > > - unsigned long cl = 0; > > char *s8; > > + > > +#ifdef CONFIG_FAST_MEMSET > > + unsigned long cl = 0; > > int i; > > > > /* do it one word at a time (32 bits or 64 bits) while possible */ > > @@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count) > > count -= sizeof(*sl); > > } > > } > > - /* fill 8 bits at a time */ > > +#endif /* fill 8 bits at a time */ > > So while this is all neat, a few ideas: > > 1) Would having memset in a header improve things even more? After all, > each external function call clobbers registers that you need to > save/restore... I'd guess it really depends on the size constraints. The regular libgeneric memset compiles on my rk3188 tpl to a total of 64bytes on both gcc-4.9 and gcc-6.3 while Simon's fast-memset comes down to 14bytes on my rk3188. On the rk3188 the only memset user is board_init_f, so here memset is called only once without needing to save registers and I'd guess if an implementation really is that size-constrained to worry about 50bytes this one caller will probably always be the only one? > 2) How much would GOLD save you? Have you tried? U-Boot is small enough > of a code base that global optimizations should be able to give > significant size savings. I think the issue that this is trying to solve is to allow more toolchains to be used and thus make rebuilds on changes work on a lot of boards at the same time with random toolchains. gcc-6.3 already produces way smaller results (well within the size constraints the rk3188 has) than for example the gcc-4.9 used by buildman as baseline toolchain. ___ U-Boot mailing list U-Boot@lists.denx.de https://lists.denx.de/listinfo/u-boot
Re: [U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()
On 27/03/2017 01:38, Simon Glass wrote: Most of the time the optimised memset() is what we want. For extreme situations such as TPL it may be too large. For example on the 'rock' board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and the rodata bug, this patch is enough to reduce the TPL image below the limit. Signed-off-by: Simon Glass--- lib/Kconfig | 9 + lib/string.c | 6 -- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/lib/Kconfig b/lib/Kconfig index 65c01573e1..5bf512d8c0 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -52,6 +52,15 @@ config LIB_RAND help This library provides pseudo-random number generator functions. +config FAST_MEMSET + bool "Use an optimised memset()" + default y + help + The faster memset() is the arch-specific one (if available) enabled + by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get + better performance by write a word at a time. Disable this option + to reduce code size slightly at the cost of some speed. The comment sounds slightly confused - it took me a few times of reading it until I grasped what it was trying to tell me :). + source lib/dhry/Kconfig source lib/rsa/Kconfig diff --git a/lib/string.c b/lib/string.c index 67d5f6a421..159493ed17 100644 --- a/lib/string.c +++ b/lib/string.c @@ -437,8 +437,10 @@ char *strswab(const char *s) void * memset(void * s,int c,size_t count) { unsigned long *sl = (unsigned long *) s; - unsigned long cl = 0; char *s8; + +#ifdef CONFIG_FAST_MEMSET + unsigned long cl = 0; int i; /* do it one word at a time (32 bits or 64 bits) while possible */ @@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count) count -= sizeof(*sl); } } - /* fill 8 bits at a time */ +#endif /* fill 8 bits at a time */ So while this is all neat, a few ideas: 1) Would having memset in a header improve things even more? After all, each external function call clobbers registers that you need to save/restore... 2) How much would GOLD save you? Have you tried? U-Boot is small enough of a code base that global optimizations should be able to give significant size savings. Alex ___ U-Boot mailing list U-Boot@lists.denx.de https://lists.denx.de/listinfo/u-boot
[U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()
Most of the time the optimised memset() is what we want. For extreme situations such as TPL it may be too large. For example on the 'rock' board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and the rodata bug, this patch is enough to reduce the TPL image below the limit. Signed-off-by: Simon Glass--- lib/Kconfig | 9 + lib/string.c | 6 -- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/lib/Kconfig b/lib/Kconfig index 65c01573e1..5bf512d8c0 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -52,6 +52,15 @@ config LIB_RAND help This library provides pseudo-random number generator functions. +config FAST_MEMSET + bool "Use an optimised memset()" + default y + help + The faster memset() is the arch-specific one (if available) enabled + by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get + better performance by write a word at a time. Disable this option + to reduce code size slightly at the cost of some speed. + source lib/dhry/Kconfig source lib/rsa/Kconfig diff --git a/lib/string.c b/lib/string.c index 67d5f6a421..159493ed17 100644 --- a/lib/string.c +++ b/lib/string.c @@ -437,8 +437,10 @@ char *strswab(const char *s) void * memset(void * s,int c,size_t count) { unsigned long *sl = (unsigned long *) s; - unsigned long cl = 0; char *s8; + +#ifdef CONFIG_FAST_MEMSET + unsigned long cl = 0; int i; /* do it one word at a time (32 bits or 64 bits) while possible */ @@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count) count -= sizeof(*sl); } } - /* fill 8 bits at a time */ +#endif /* fill 8 bits at a time */ s8 = (char *)sl; while (count--) *s8++ = c; -- 2.12.1.578.ge9c3154ca4-goog ___ U-Boot mailing list U-Boot@lists.denx.de https://lists.denx.de/listinfo/u-boot