Re: [racket-dev] can we write these four lines of C in performant racket?
At Mon, 25 Jul 2011 08:05:46 -0400, Sam Tobin-Hochstadt wrote: > On Mon, Jul 25, 2011 at 7:51 AM, Matthew Flatt wrote: > > > > Here are some timings for 1000 iterations on 2^20-element inputs > > (32-bit mode, Mac Book Pro 2.53 GHz): > > > > C as above, gcc -02 : 1409 > > C with indirections, gcc -O2 : 4041 > > C as above, gcc -O0 : 6425 > > C with indirections, gcc -O0 : 8480 > > Racket fxvector (more direct): 8883 > > Racket : 11248 > > > > I can tweak the JIT in small ways to make a small improvement: > > > > Racket with JIT tweaks : 10670 > > What do the JIT tweaks do to the performance of the fxvector version? The tweak doesn't change `fxvector-set!'. _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
Re: [racket-dev] can we write these four lines of C in performant racket?
On Mon, Jul 25, 2011 at 7:51 AM, Matthew Flatt wrote: > > Here are some timings for 1000 iterations on 2^20-element inputs > (32-bit mode, Mac Book Pro 2.53 GHz): > > C as above, gcc -02 : 1409 > C with indirections, gcc -O2 : 4041 > C as above, gcc -O0 : 6425 > C with indirections, gcc -O0 : 8480 > Racket fxvector (more direct): 8883 > Racket : 11248 > > I can tweak the JIT in small ways to make a small improvement: > > Racket with JIT tweaks : 10670 What do the JIT tweaks do to the performance of the fxvector version? -- sam th sa...@ccs.neu.edu _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
Re: [racket-dev] can we write these four lines of C in performant racket?
At Sat, 23 Jul 2011 14:42:15 -0400, John Clements wrote: > This C code adds the content of one buffer to another one, with no checking. > The corresponding racket code runs about 10x slower. Do you folks think that > it > should be possible to do better? (One salient fact: these are > shorts--16-bit-ints--not 32-bit ints.) My sources say no --- at least not a lot better with the current JIT. > void addOn(short *dst, int dstOffset, short *src, int srcOffset, int len) { > int i; > for (i = 0; i dst[i+dstOffset] += src[i+srcOffset]; > }} One problem is that a `s16-vector' in Racket is a struct that wraps an FFI wrapper around the array pointer, instead of just an array pointer. So, a C version of the Racket loop is more like void addOn(short_ptr *dst, int dstOffset, short_ptr *src, int srcOffset, int len) { int i; for (i = 0; ia->a[i+dstOffset] += src->a->a[i+srcOffset]; } } Another problem is that the JIT doesn't use registers well enough, so its code is closer in this case to unoptimized gcc output. Finally, there's some extra overhead for the fixnum encoding. Here are some timings for 1000 iterations on 2^20-element inputs (32-bit mode, Mac Book Pro 2.53 GHz): C as above, gcc -02 : 1409 C with indirections, gcc -O2 : 4041 C as above, gcc -O0 : 6425 C with indirections, gcc -O0 : 8480 Racket fxvector (more direct): 8883 Racket : 11248 I can tweak the JIT in small ways to make a small improvement: Racket with JIT tweaks : 10670 The Racket code I tried is below (ugly to run as fast as I could make it). #lang racket/base (require racket/unsafe/ops ffi/vector) (define SIZE 1048576) (define addOn #f) ; defeat inlining of `addOn' (set! addOn (lambda (dst dstOffset src srcOffset len) (let loop ([i 0]) (unless (eq? i len) (let ([d (unsafe-fx+ dstOffset i)]) (unsafe-s16vector-set! dst d (unsafe-fx+ (unsafe-s16vector-ref dst d) (unsafe-s16vector-ref src (unsafe-fx+ srcOffset i) (loop (unsafe-fx+ i 1)) (let ([a (make-s16vector SIZE)] [b (make-s16vector SIZE)]) (time (for ([i (in-range 1000)]) (addOn a 0 b 0 SIZE _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
Re: [racket-dev] can we write these four lines of C in performant racket?
Is the C code compiled to vectorised assembler? That could account for a factor of about 4-16 depending. N. On Sat, Jul 23, 2011 at 7:42 PM, John Clements wrote: > This C code adds the content of one buffer to another one, with no checking. > The corresponding racket code runs about 10x slower. Do you folks think that > it should be possible to do better? (One salient fact: these are > shorts--16-bit-ints--not 32-bit ints.) > > John _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
Re: [racket-dev] can we write these four lines of C in performant racket?
On Jul 23, 2011, at 2:46 PM, Robby Findler wrote: > What is the data you're using to represent the shorts in Racket? #s. John > > Robby > > On Sat, Jul 23, 2011 at 1:42 PM, John Clements > wrote: >> This C code adds the content of one buffer to another one, with no checking. >> The corresponding racket code runs about 10x slower. Do you folks think >> that it should be possible to do better? (One salient fact: these are >> shorts--16-bit-ints--not 32-bit ints.) >> >> John >> >> >> >> void addOn(short *dst, int dstOffset, short *src, int srcOffset, int len) { >> int i; >> for (i = 0; i>dst[i+dstOffset] += src[i+srcOffset]; >> }} >> >> >> _ >> For list-related administrative tasks: >> http://lists.racket-lang.org/listinfo/dev >> smime.p7s Description: S/MIME cryptographic signature _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
Re: [racket-dev] can we write these four lines of C in performant racket?
What is the data you're using to represent the shorts in Racket? Robby On Sat, Jul 23, 2011 at 1:42 PM, John Clements wrote: > This C code adds the content of one buffer to another one, with no checking. > The corresponding racket code runs about 10x slower. Do you folks think that > it should be possible to do better? (One salient fact: these are > shorts--16-bit-ints--not 32-bit ints.) > > John > > > > void addOn(short *dst, int dstOffset, short *src, int srcOffset, int len) { > int i; > for (i = 0; i dst[i+dstOffset] += src[i+srcOffset]; > }} > > > _ > For list-related administrative tasks: > http://lists.racket-lang.org/listinfo/dev > _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev
[racket-dev] can we write these four lines of C in performant racket?
This C code adds the content of one buffer to another one, with no checking. The corresponding racket code runs about 10x slower. Do you folks think that it should be possible to do better? (One salient fact: these are shorts--16-bit-ints--not 32-bit ints.) John void addOn(short *dst, int dstOffset, short *src, int srcOffset, int len) { int i; for (i = 0; i smime.p7s Description: S/MIME cryptographic signature _ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/dev