Re: Small memcpy optimization
On Thu, 8 Nov 2012, Mark Kettenis wrote: On Tuesday 21 August 2012, Stefan Fritsch wrote: On x86, the xchg operation between reg and mem has an implicit lock prefix, i.e. it is a relatively expensive atomic operation. This is not needed here. OKs, anyone? What you say makes sense, although it might matter only on MP (capable) systems. True, but MP is the norm nowadays. If you really want to make things faster, I suppose you could change the code into something like pushl %esi pushl %edi movl12(%esp),%edi movl16(%esp),%esi That's true. Like this (suggestions for a better label name are welcome): --- locore.s +++ locore.s @@ -789,7 +789,7 @@ ENTRY(bcopy) pushl %edi movl12(%esp),%esi movl16(%esp),%edi - movl20(%esp),%ecx +bcopy2:movl20(%esp),%ecx movl%edi,%eax subl%esi,%eax cmpl%ecx,%eax # overlapping? @@ -827,13 +827,15 @@ ENTRY(bcopy) ret /* - * Emulate memcpy() by swapping the first two arguments and calling bcopy() + * Emulate memcpy() by loading the first two arguments in reverse order + * and jumping into bcopy() */ ENTRY(memcpy) - movl4(%esp),%ecx - xchg8(%esp),%ecx - movl%ecx,4(%esp) - jmp _C_LABEL(bcopy) + pushl %esi + pushl %edi + movl12(%esp),%edi + movl16(%esp),%esi + jmp bcopy2 /*/
Re: Small memcpy optimization
Date: Sat, 10 Nov 2012 18:10:53 +0100 (CET) From: Stefan Fritsch s...@sfritsch.de On Thu, 8 Nov 2012, Mark Kettenis wrote: On Tuesday 21 August 2012, Stefan Fritsch wrote: On x86, the xchg operation between reg and mem has an implicit lock prefix, i.e. it is a relatively expensive atomic operation. This is not needed here. OKs, anyone? What you say makes sense, although it might matter only on MP (capable) systems. True, but MP is the norm nowadays. If you really want to make things faster, I suppose you could change the code into something like pushl %esi pushl %edi movl12(%esp),%edi movl16(%esp),%esi That's true. Like this (suggestions for a better label name are welcome): What about doocpy? And I would put the label on a line of its own, such that it stands out more. --- locore.s +++ locore.s @@ -789,7 +789,7 @@ ENTRY(bcopy) pushl %edi movl12(%esp),%esi movl16(%esp),%edi - movl20(%esp),%ecx +bcopy2: movl20(%esp),%ecx movl%edi,%eax subl%esi,%eax cmpl%ecx,%eax # overlapping? @@ -827,13 +827,15 @@ ENTRY(bcopy) ret /* - * Emulate memcpy() by swapping the first two arguments and calling bcopy() + * Emulate memcpy() by loading the first two arguments in reverse order + * and jumping into bcopy() */ ENTRY(memcpy) - movl4(%esp),%ecx - xchg8(%esp),%ecx - movl%ecx,4(%esp) - jmp _C_LABEL(bcopy) + pushl %esi + pushl %edi + movl12(%esp),%edi + movl16(%esp),%esi + jmp bcopy2 /*/
Re: Small memcpy optimization
From: Stefan Fritsch s...@sfritsch.de Date: Thu, 1 Nov 2012 22:43:33 +0100 On Tuesday 21 August 2012, Stefan Fritsch wrote: On x86, the xchg operation between reg and mem has an implicit lock prefix, i.e. it is a relatively expensive atomic operation. This is not needed here. OKs, anyone? What you say makes sense, although it might matter only on MP (capable) systems. If you really want to make things faster, I suppose you could change the code into something like pushl %esi pushl %edi movl12(%esp),%edi movl16(%esp),%esi and then jump to the 5th instruction of bcopy(). --- a/sys/arch/i386/i386/locore.s +++ b/sys/arch/i386/i386/locore.s @@ -802,8 +802,9 @@ ENTRY(bcopy) */ ENTRY(memcpy) movl4(%esp),%ecx - xchg8(%esp),%ecx - movl%ecx,4(%esp) + movl8(%esp),%eax + movl%ecx,8(%esp) + movl%eax,4(%esp) jmp _C_LABEL(bcopy)
Re: Small memcpy optimization
On Thu, Nov 01, 2012 at 22:43, Stefan Fritsch wrote: On Tuesday 21 August 2012, Stefan Fritsch wrote: On x86, the xchg operation between reg and mem has an implicit lock prefix, i.e. it is a relatively expensive atomic operation. This is not needed here. OKs, anyone? What do other implementations do? Benchmarks? I'm sure someone somewhere has spent a lot of effort making the world's fastest memcpy. Taking that work seems better than home grown fiddling.
Re: Small memcpy optimization
On Tuesday 21 August 2012, Stefan Fritsch wrote: On x86, the xchg operation between reg and mem has an implicit lock prefix, i.e. it is a relatively expensive atomic operation. This is not needed here. OKs, anyone? --- a/sys/arch/i386/i386/locore.s +++ b/sys/arch/i386/i386/locore.s @@ -802,8 +802,9 @@ ENTRY(bcopy) */ ENTRY(memcpy) movl4(%esp),%ecx - xchg8(%esp),%ecx - movl%ecx,4(%esp) + movl8(%esp),%eax + movl%ecx,8(%esp) + movl%eax,4(%esp) jmp _C_LABEL(bcopy)
Small memcpy optimization
On x86, the xchg operation between reg and mem has an implicit lock prefix, i.e. it is a relatively expensive atomic operation. This is not needed here. --- a/sys/arch/i386/i386/locore.s +++ b/sys/arch/i386/i386/locore.s @@ -802,8 +802,9 @@ ENTRY(bcopy) */ ENTRY(memcpy) movl4(%esp),%ecx - xchg8(%esp),%ecx - movl%ecx,4(%esp) + movl8(%esp),%eax + movl%ecx,8(%esp) + movl%eax,4(%esp) jmp _C_LABEL(bcopy)