Re: Performance of tables slower than built in?
On Saturday, 25 May 2019 at 20:07:19 UTC, NaN wrote: Its differential evolution, you can pretty much optimise anything you can quantify. http://www1.icsi.berkeley.edu/~storn/code.html Thanks, they even have Python example code. Nice.
Re: Performance of tables slower than built in?
On Saturday, 25 May 2019 at 14:58:25 UTC, Ola Fosheim Grøstad wrote: On Saturday, 25 May 2019 at 12:51:20 UTC, NaN wrote: I used an evolutionary optimisation algorithm on the table all at once. So you do a weighted sum of max deviation, and 1st and 2nd order discontinuity at the joins. And minimise that across the table as a whole. It seemed you could massively overweight the discontinuities without really affecting the curve fitting that much. So perfect joins only cost a little extra deviation in the fit of the polynomial. Wow, it is pretty cool that you managed to minimize 2nd order discontinuity like that! Is there a paper describing the optimisation algorithm you used? Its differential evolution, you can pretty much optimise anything you can quantify. http://www1.icsi.berkeley.edu/~storn/code.html
Re: Performance of tables slower than built in?
On Saturday, 25 May 2019 at 12:51:20 UTC, NaN wrote: I used an evolutionary optimisation algorithm on the table all at once. So you do a weighted sum of max deviation, and 1st and 2nd order discontinuity at the joins. And minimise that across the table as a whole. It seemed you could massively overweight the discontinuities without really affecting the curve fitting that much. So perfect joins only cost a little extra deviation in the fit of the polynomial. Wow, it is pretty cool that you managed to minimize 2nd order discontinuity like that! Is there a paper describing the optimisation algorithm you used?
Re: Performance of tables slower than built in?
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? [...] I'll just leave this here: https://www.dannyarends.nl/Using%20CTFE%20in%20D%20to%20speed%20up%20Sine%20and%20Cosine?
Re: Performance of tables slower than built in?
On Saturday, 25 May 2019 at 09:52:22 UTC, Ola Fosheim Grøstad wrote: On Saturday, 25 May 2019 at 09:04:31 UTC, NaN wrote: Its pretty common technique in audio synthesis. Indeed. CSound does this. What i've done in the past is store a table of polynomial segments that were optimised with curve fitting. That's an interesting solution, how do you avoid higher order discontinuities between segments? Crossfading? Or maybe it wasn't audible. I used an evolutionary optimisation algorithm on the table all at once. So you do a weighted sum of max deviation, and 1st and 2nd order discontinuity at the joins. And minimise that across the table as a whole. It seemed you could massively overweight the discontinuities without really affecting the curve fitting that much. So perfect joins only cost a little extra deviation in the fit of the polynomial.
Re: Performance of tables slower than built in?
On Saturday, 25 May 2019 at 09:04:31 UTC, NaN wrote: It's bit extra work to calculate the the waveform but actual works out faster than having huge LUTs since you're typically only producing maybe 100 samples in each interrupt callback Another hybrid option when filling a buffer might be to fill two buffers with an approximation that is crossfaded between the end-points. A bit tricky, but say you have a taylor polynomial (or even a recurrence relation) that is "correct" near the beginning of the buffer, and another one that is correct near the end of the buffer. Then you could fill two buffers from each end and cross fade between the buffers. You might get some phase-cancellation errors and phasing-like distortion though.
Re: Performance of tables slower than built in?
On Saturday, 25 May 2019 at 09:04:31 UTC, NaN wrote: Its pretty common technique in audio synthesis. Indeed. CSound does this. What i've done in the past is store a table of polynomial segments that were optimised with curve fitting. That's an interesting solution, how do you avoid higher order discontinuities between segments? Crossfading? Or maybe it wasn't audible.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 17:40:40 UTC, Ola Fosheim Grøstad wrote: On Friday, 24 May 2019 at 17:04:33 UTC, Alex wrote: I'm not sure what the real precision of the build in functions are but it shouldn't be hard to max out a double using standard methods(even if slow, but irrelevant after the LUT has been created). LUTs are primarily useful when you use sin(x) as a signal or when a crude approximation is good enough. One advantage of a LUT is that you can store a more complex computation than the basic function. Like a filtered square wave. Its pretty common technique in audio synthesis. What i've done in the past is store a table of polynomial segments that were optimised with curve fitting. It's bit extra work to calculate the the waveform but actual works out faster than having huge LUTs since you're typically only producing maybe 100 samples in each interrupt callback, so it gets pretty likely that your LUT is pushed into slower cache memory in between calls to generate the the audio.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 17:04:33 UTC, Alex wrote: On Friday, 24 May 2019 at 13:57:30 UTC, Ola Fosheim Grøstad wrote: On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote: If it truly is a 27x faster then then that is very relevant and knowing why is important. Of course, a lot of that might simply be due to LDC and I wasn't able to determine this. Just one more thing you really ought to consider: It isn't obvious that a LUT using double will be more precise than computing sin using single precision float. So when I use single precision float with ldc and "-O3 -ffast-math" then I get roughly 300ms. So in that case the LUT is only 3 times faster. Perhaps not worth it then. You might as well just use float instead of double. The downside is that -ffast-math makes all your computations more inaccurate. It is also possible that recent processors can do multiple sin/cos as simd. Too many options… I know. The thing is, the LUT can have as much precision as one wants. One could even spend several days calculating it then loading it from disk. With linear interpolation of what is effectively a discrete time signal, you get an extra 12db signal to noise ratio each time you 2x oversample the input. So basically if you start with LUT of 4 samples, that should give you about 8dbs at baseline, each time you double the length you get an extra 12dbs. Or in simpler terms double the table length get 2 bits less error. So 8192 table size should give you approximately 24 bits clear of errors. But be aware that error is relative to the maximum magnitude of the signal in the table.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 17:04:33 UTC, Alex wrote: I'm not sure what the real precision of the build in functions are but it shouldn't be hard to max out a double using standard methods(even if slow, but irrelevant after the LUT has been created). LUTs are primarily useful when you use sin(x) as a signal or when a crude approximation is good enough. One advantage of a LUT is that you can store a more complex computation than the basic function. Like a filtered square wave.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 13:57:30 UTC, Ola Fosheim Grøstad wrote: On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote: If it truly is a 27x faster then then that is very relevant and knowing why is important. Of course, a lot of that might simply be due to LDC and I wasn't able to determine this. Just one more thing you really ought to consider: It isn't obvious that a LUT using double will be more precise than computing sin using single precision float. So when I use single precision float with ldc and "-O3 -ffast-math" then I get roughly 300ms. So in that case the LUT is only 3 times faster. Perhaps not worth it then. You might as well just use float instead of double. The downside is that -ffast-math makes all your computations more inaccurate. It is also possible that recent processors can do multiple sin/cos as simd. Too many options… I know. The thing is, the LUT can have as much precision as one wants. One could even spend several days calculating it then loading it from disk. I'm not sure what the real precision of the build in functions are but it shouldn't be hard to max out a double using standard methods(even if slow, but irrelevant after the LUT has been created). Right now I'm just using the built ins... Maybe later I'll get back around to all this and make some progress.
Re: Performance of tables slower than built in?
With linear interpolation you get roughly the same result with floats, a little more efficient too (half the memory and a bit faster): __gshared float[512+1] QuarterSinTab; void init(){ const auto res = QuarterSinTab.length-1; for(int i = 0; i < res; i++) QuarterSinTab[i] = sin(PI*(0.5*i/cast(double)res)); QuarterSinTab[$-1] = sin(PI*0.5); } auto fastQuarterLookup(float x){ const uint mantissa = cast(uint)( (x - floor(x)) * (cast(float)(1U<<24)) ); const float sign = cast(float)(1 - cast(int)((mantissa>>22)&2)); const uint phase = (mantissa^((1U<<23)-((mantissa>>22)&1)))&((1U<<23) -1); const uint quarterphase = (phase>>13)&511; const float frac = cast(float)(phase&((1U<<13)-1))*cast(float)(1.0f/(1U<<13)); return sign*((1.0f-frac)*QuarterSinTab[quarterphase] + frac*QuarterSinTab[quarterphase+1]); }
Re: Performance of tables slower than built in?
This appears to get roughly the same results as sin(x): __gshared double[512+1] QuarterSinTab; void init(){ const auto res = QuarterSinTab.length-1; for(int i = 0; i < res; i++) QuarterSinTab[i] = sin(PI*(0.5*i/cast(double)res)); QuarterSinTab[$-1] = sin(PI*0.5); } auto fastQuarterLookup(double x){ const ulong mantissa = cast(ulong)( (x - floor(x)) * (cast(double)(1UL<<54)) ); const double sign = cast(double)(1 - cast(int)((mantissa>>52)&2)); const ulong phase = (mantissa^((1UL<<53)-((mantissa>>52)&1)))&((1UL<<53) -1); const uint quarterphase = (phase>>43)&511; const double frac = cast(double)(phase&((1UL<<43)-1))*cast(double)(1.0/(1UL<<43)); return sign*((1.0-frac)*QuarterSinTab[quarterphase] + frac*QuarterSinTab[quarterphase+1]); } Ola.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote: const double sign = cast(double)(-cast(uint)((mantissa>>63)&1)); Yep, this was wrong (0 or -1). Should be something like (1 or -1): const double sign = cast(double)(1-cast(uint)((mantissa>>62)&2)); You'll have to code it up more carefully than I did, just following the same principles. (These ±1 errors do not affect the performance.). Also, for comparison, just running the 2 lookups in the loop are at 32ms. So 3 times that sounds reasonable for extracting the phase, determining the sign, reversing the phase and doing the linear interpolation.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote: If it truly is a 27x faster then then that is very relevant and knowing why is important. Of course, a lot of that might simply be due to LDC and I wasn't able to determine this. Just one more thing you really ought to consider: It isn't obvious that a LUT using double will be more precise than computing sin using single precision float. So when I use single precision float with ldc and "-O3 -ffast-math" then I get roughly 300ms. So in that case the LUT is only 3 times faster. Perhaps not worth it then. You might as well just use float instead of double. The downside is that -ffast-math makes all your computations more inaccurate. It is also possible that recent processors can do multiple sin/cos as simd. Too many options… I know.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 12:56:55 UTC, Ola Fosheim Grøstad wrote: suggest above. Just beware that I haven't checked the code, so it might be off by ±1 and such. So before using such code for anything important, make sure that it is tested for the edge cases, like denormal numbers (values close to zero). Roundoff-errors where computations "accidentally" cross over 1.0 and stuff like that.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote: So it seems like a quarter-wave LUT is 27 times faster than sin… If so then that is great and what I'd expected to achieve originally. I guess this is using LDC though? I wasn't able to compile with LDC since after updating I'm getting linker errors that I have to go figure out. Yes, the gist linked above is just your code with minor changes, that was 4-5 times faster. To get to 27 times faster you need to use the integer bit-manipulation scheme that I suggest above. Just beware that I haven't checked the code, so it might be off by ±1 and such. Anyway, it is more fun for you to code up your own version than to try to figure out mine. Just follow the principles and you should get close to that performance, I think. (I'll refine the code later, but don't really have time now) You just have to make sure that the generated instructions fills the entire CPU pipeline. What exactly does this mean? I realize the pipeline in cpu's is how the cpu decodes and optimizes the instructions but when you say "You have to make sure" that pre-supposes there is a method or algorithm to know. Yes, you have to look up information about the CPU in your computer. Each core has a set of "lanes" that are computed simultanously. Some instructions can go into many lanes, but not all. Then there might be bubbles in the pipeline (the lane) that can be filled up with integer/bit manipulation instructions. It is tedious to look that stuff up. So, last resort. Just try to mix simple integer with simple double computations (avoid division). Are you saying that I did not have enough instructions that the pipeline could take advantage of? Yes, you most likely got bubbles. Empty space where the core has nothing to send down a lane, because it is waiting for some computation to finish so that it can figure out what to do next. Basic optimization: Step 1: reduce dependencies between computations Step 2: make sure you generate a mix of simple integer/double instructions that can fill up all the computation lanes at the same time Step 3: make sure loops only contain a few instructions, the CPU can unroll loops in hardware if they are short. (not valid here though) Of course, a lot of that might simply be due to LDC and I wasn't able to determine this. I think I got better performance because I filled more lanes in the pipeline. Half sin was done above but quarter sine can be used(there are 4 quadrants but only one has to be tabularized because all the others differ by sign and reversion(1 - x), it's a matter of figuring out the sign). Yes, as I mentioned, the first bit of the phase is the sign and the second bit of the phase is the reversion of the indexing. Of course it requires extra computation so it would be interesting to see the difference in performance for the extra logic. It adds perhaps 2-5 cycles or so, my guessing. exp(x) can be written as exp(floor(x) + {x}) = exp(floor(x))*exp({x}) [...] With linear interpolation one can get very accurate(for all practical purposes) LUT table methods that, if your code is right, is at least an order of magnitude faster. The memory requirements will be quite small with linear interpolation I think you need to do something with the x before you look up, so that you have some kind of fast nonlinear mapping to the indexes. But for exp() you might prefer an approximation instead, perhaps polynomial taylor series perhaps. Searching the web should give some ideas. It seems you already have the half-sin done. I did the quarter sin though, not the half-sin (but that is almost the same, just drop the reversion of the indexing). (Let's talk about this later, since we both have other things on our plate. Fun topic! :-)
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote: On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote: On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote: Either way, sin it's still twice as fast. Also, in the code the sinTab version is missing the writeln so it would have been faster.. so it is not being optimized out. Well, when I run this modified version: https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c on https://run.dlang.io/ then I get: LUT:709 sin(x): 2761 So the LUT is 3-4 times faster even with your quarter-LUT overhead. FWIW, as far as I can tell I managed to get the lookup version down to 104 by using bit manipulation tricks like these: auto fastQuarterLookup(double x){ const ulong mantissa = cast(ulong)( (x - floor(x)) * (cast(double)(1UL<<63)*2.0) ); const double sign = cast(double)(-cast(uint)((mantissa>>63)&1)); … etc So it seems like a quarter-wave LUT is 27 times faster than sin… If so then that is great and what I'd expected to achieve originally. I guess this is using LDC though? I wasn't able to compile with LDC since after updating I'm getting linker errors that I have to go figure out. You just have to make sure that the generated instructions fills the entire CPU pipeline. What exactly does this mean? I realize the pipeline in cpu's is how the cpu decodes and optimizes the instructions but when you say "You have to make sure" that pre-supposes there is a method or algorithm to know. Are you saying that I did not have enough instructions that the pipeline could take advantage of? In any case, I'll check your code out and try to figure out the details and see exactly what is going on. If it truly is a 27x faster then then that is very relevant and knowing why is important. Of course, a lot of that might simply be due to LDC and I wasn't able to determine this. Can you do some stats for dmd and ldc? You seem to be interested in this, are you up for a challenge? The idea is to use tables to optimize these functions. Half sin was done above but quarter sine can be used(there are 4 quadrants but only one has to be tabularized because all the others differ by sign and reversion(1 - x), it's a matter of figuring out the sign). Of course it requires extra computation so it would be interesting to see the difference in performance for the extra logic. Then there is exp exp(x) can be written as exp(floor(x) + {x}) = exp(floor(x))*exp({x}) and so one can optimize this by tabulating exp(x) for 0<= x < 1 which is the fractional part of exp(x). Then tabulating it for a wide range of integers(probably in 2's). e.g., exp(3.5) = exp(3)*exp(.5) both come from a lookup table. or one could do exp(3) = exp(1 + 1 + 1) = exp(1)*exp(1)*exp(1) (this requires iteration if we do not tabulate exp(3). Hence one would limit the iteration count by tabulating things like exp(10^k) and exp(k) for -10 < k < 10. The idea is basically one can get really dense LUT's for a small range that then are used to build the function for arbitrary inputs. With linear interpolation one can get very accurate(for all practical purposes) LUT table methods that, if your code is right, is at least an order of magnitude faster. The memory requirements will be quite small with linear interpolation(and ideally quadratic or cubic if the costs are not too high). That was what I was starting to work on before I got thrown off by it being much slower. It seems you already have the half-sin done. One could do things like sin, cos(obviously easy), exp, exp()^2, erf(x), sinh, cosh, etc. Things like sec could also be done as it would save a division since it seems they take about 30 cycles. But it would depend on the memory used. [I can't mess with this now because I've gotta work other things at the moment] Thanks.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 12:01:55 UTC, Alex wrote: Well, the QuarterWave was suppose to generate just a quarter since that is all that is required for these functions due to symmetry and periodicity. I started with a half to get that working then figure out the sign flipping. Sure, it is a tradeoff. You pollute the cache less this way, but you have to figure out the sign and the lookup-direction. The trick is then to turn the phase into an unsigned integer then you get: 1. the highest bit will tell you that you need to use the inverse sign for the result. 2. the next highest bit will tell you that you need too look up in the reverse direction What is key to performance here is that x86 can do many simple integer/bit operations in parallel, but only a few floating point operations. Also avoid all conditionals. Use bitmasking instead, something along the line of: const ulong phase = mantissa^((1UL<<63)-((mantissa>>62)&1)); const uint quarterphase = (phase>>53)&511; (Haven't checked the correctness of this, but this shows the general principle.) Ola.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 11:57:44 UTC, Alex wrote: On Friday, 24 May 2019 at 08:13:00 UTC, Basilez B. wrote: On Thursday, 23 May 2019 at 10:16:42 UTC, Alex wrote: On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote: On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? [...] Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/} Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions. You'll directly see a 15% gain after refactoring the switch. Surely not?!?! Surely the compiler can optimize that switch since the value passed is CT? I thought the whole point of not having static switch(analogous to static if) was because it would go ahead and optimize these cases for us... and it's just a switch, just a jmp table. Try by yourself but to be clear note that I don't like your attitude, which I find disrespectful. Are you an idiot or on medications? My attitude? What did you want me to do? Suck your cock? My attitude? Seriously? WHERE? WHERE? You quoted everything? I don't like your attitude? You seem to be extremely oversensitive and take things out of context? It sounds like you just don't like people questioning you in any way shape or form even if you are wrong. "Surely not?!?! Surely the compiler can optimize that switch since the value passed is CT? I thought the whole point of not having static switch(analogous to static if) was because it would go ahead and optimize these cases for us... and it's just a switch, just a jmp table." Where? I seriously do not like your attitude though! You attack me when I said nothing offensive to you. You have serious problems. Get back on or get off the lithium. [Now, of course, you get to see what an attitude really looks like... maybe I've taught you a bit about perspective? I doubt it.] What amazes me is that you Basilez come in and start stuff and yet my response to you starting stuff will get deleted and yours won't... that is typical. You really have mental issues. No where in my post was I hostile to you and yet you decided to have a hissy fit. You are a snowflake that obviously need a ass kicking. That will teach you not to be so sensitive and interpret things in your own little cushioned wall universe. See, no one was attacking you or questioning your intelligence... that is something you perceived and made up all on your own. But you will be supported by all the other people who love cushy walls and koolaid rather than getting the help you need.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote: On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote: On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote: Either way, sin it's still twice as fast. Also, in the code the sinTab version is missing the writeln so it would have been faster.. so it is not being optimized out. Well, when I run this modified version: https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c on https://run.dlang.io/ then I get: LUT:709 sin(x): 2761 So the LUT is 3-4 times faster even with your quarter-LUT overhead. FWIW, as far as I can tell I managed to get the lookup version down to 104 by using bit manipulation tricks like these: auto fastQuarterLookup(double x){ const ulong mantissa = cast(ulong)( (x - floor(x)) * (cast(double)(1UL<<63)*2.0) ); const double sign = cast(double)(-cast(uint)((mantissa>>63)&1)); … etc So it seems like a quarter-wave LUT is 27 times faster than sin… You just have to make sure that the generated instructions fills the entire CPU pipeline. Well, the QuarterWave was suppose to generate just a quarter since that is all that is required for these functions due to symmetry and periodicity. I started with a half to get that working then figure out the sign flipping. Essentially one just has to tabulate a quarter of sin, that is, from 0 to 90o and then get the sin right. This allows one to have 4 times the resolution or 1/4 the size at the same cost. Or, to put it another say, sin as 4 fold redundancy. I'll check out your code, thanks for looking in to it.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 08:13:00 UTC, Basilez B. wrote: On Thursday, 23 May 2019 at 10:16:42 UTC, Alex wrote: On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote: On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? [...] Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/} Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions. You'll directly see a 15% gain after refactoring the switch. Surely not?!?! Surely the compiler can optimize that switch since the value passed is CT? I thought the whole point of not having static switch(analogous to static if) was because it would go ahead and optimize these cases for us... and it's just a switch, just a jmp table. Try by yourself but to be clear note that I don't like your attitude, which I find disrespectful. Are you an idiot or on medications? My attitude? What did you want me to do? Suck your cock? My attitude? Seriously? WHERE? WHERE? You quoted everything? I don't like your attitude? You seem to be extremely oversensitive and take things out of context? It sounds like you just don't like people questioning you in any way shape or form even if you are wrong. "Surely not?!?! Surely the compiler can optimize that switch since the value passed is CT? I thought the whole point of not having static switch(analogous to static if) was because it would go ahead and optimize these cases for us... and it's just a switch, just a jmp table." Where? I seriously do not like your attitude though! You attack me when I said nothing offensive to you. You have serious problems. Get back on or get off the lithium. [Now, of course, you get to see what an attitude really looks like... maybe I've taught you a bit about perspective? I doubt it.]
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote: On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote: Either way, sin it's still twice as fast. Also, in the code the sinTab version is missing the writeln so it would have been faster.. so it is not being optimized out. Well, when I run this modified version: https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c on https://run.dlang.io/ then I get: LUT:709 sin(x): 2761 So the LUT is 3-4 times faster even with your quarter-LUT overhead. FWIW, as far as I can tell I managed to get the lookup version down to 104 by using bit manipulation tricks like these: auto fastQuarterLookup(double x){ const ulong mantissa = cast(ulong)( (x - floor(x)) * (cast(double)(1UL<<63)*2.0) ); const double sign = cast(double)(-cast(uint)((mantissa>>63)&1)); … etc So it seems like a quarter-wave LUT is 27 times faster than sin… You just have to make sure that the generated instructions fills the entire CPU pipeline.
Re: Performance of tables slower than built in?
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: for(int i = 0; i < res; i++) QuarterSinTab[i] = sin(PI*(i/cast(double)res)); Btw, I think this creates a half sine, not a quarter, so you want (?): QuarterSinTab[i] = sin(PI*(0.5*i/cast(double)res)); QuarterSinTab[$-1] = QuarterSinTab[0]; This creates a discontinuity if you create a quarter sine, in that case you probably wanted: QuarterSinTab[$-1] = sin(PI*0.5) Otherwise you will never get 1 or -1. But none of these affect performance.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote: https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c I made an error here: "return s*((1 - f)*QuarterSinTab[ai&511] + f*QuarterSinTab[(ai+1)&511]);" Should of course be: return s*((1 - f)*QuarterSinTab[ai&511] + f*QuarterSinTab[(ai&511)+1]); However that does not impact the performance.
Re: Performance of tables slower than built in?
On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote: Either way, sin it's still twice as fast. Also, in the code the sinTab version is missing the writeln so it would have been faster.. so it is not being optimized out. Well, when I run this modified version: https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c on https://run.dlang.io/ then I get: LUT:709 sin(x): 2761 So the LUT is 3-4 times faster even with your quarter-LUT overhead.
Re: Performance of tables slower than built in?
On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote: So the LUT is 3-4 times faster even with your quarter-LUT overhead. 4-5 times faster actually, since I made the LUT size known at compiletime.
Re: Performance of tables slower than built in?
On Thursday, 23 May 2019 at 10:16:42 UTC, Alex wrote: On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote: On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? [...] Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/} Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions. You'll directly see a 15% gain after refactoring the switch. Surely not?!?! Surely the compiler can optimize that switch since the value passed is CT? I thought the whole point of not having static switch(analogous to static if) was because it would go ahead and optimize these cases for us... and it's just a switch, just a jmp table. Try by yourself but to be clear note that I don't like your attitude, which I find disrespectful. I'm just here to help, I explained you the big problem you have in your code and you start discussing something that's not to be discussed AT ALL. Look at this https://d.godbolt.org/z/vtzVdp, in sinTab you'll be able to see callpure nothrow @nogc @safe int object.__switch!(immutable(char), "Linear", "Constant").__switch(scope const(immutable(char)[]))@PLT and this even with LDC2 -O3. That's why your LUT is so slow. refactor the switch with "static if", the branch will be eliminated and you'll to see the perf improvement.
Re: Performance of tables slower than built in?
On Thursday, 23 May 2019 at 21:50:38 UTC, Alex wrote: I've used very small LUT's like a length of 5 and it didn't significantly change anything. Use a size that is 2^n, then mask the index and hopefully that will turn off bounds checks. E.g. If LUT size is 16, then index the lut with "i&15"? I haven't tested this well but was just thrown off by the results as it should easily have been inverted and I expected quite a significant speed up(several factors) and not the reverse. Well, you could take the time times clock frequency, divide it by number of iterations and calculate number of cycles per iteration. If it is more than a dozen, then something is wrong.
Re: Performance of tables slower than built in?
On Thursday, 23 May 2019 at 21:30:47 UTC, Alex wrote: I don't see how that can necessarily be faster. A LUT can give full 64-bit precision with one operation. The CORDIC needs iteration, at least 10 to be of any use. LUT's are precision independent assuming the creation cost is not included. It isn't, IEEE sin is typically not fast. Arm cpus let you run fewer iterations though for nonstandard floats. I have some code that uses sin, exp and a few other primitive algebraic functions and in one case the code is extremely slow(it uses exp). I haven't done much testing of all this but something just seems off somewhere. Dont know about exp, but some operations are slow when you get too close to zero, so called denormal numbers. I guess there is no tool that can tell one exactly what is happening to a piece of code in the cpu... basically an instruction level profiler? Vtune from Intel, not free... Afaik.
Re: Performance of tables slower than built in?
On Thursday, 23 May 2019 at 18:57:03 UTC, Ola Fosheim Grøstad wrote: On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? Not when I tried it with one of the online compilers, LUT is 3-4 times faster on your tight inner loop, but I guess it depends on the CPU. LUT should be very fast in long-running tight inner loops like that once the cache is warm with such as small LUT that fits in working memory (cache). I've used very small LUT's like a length of 5 and it didn't significantly change anything. Unless it is being hit hard at the start and that is skewing the results, it doesn't seem to be the issue. I haven't tested this well but was just thrown off by the results as it should easily have been inverted and I expected quite a significant speed up(several factors) and not the reverse.
Re: Performance of tables slower than built in?
On Thursday, 23 May 2019 at 19:17:40 UTC, Ola Fosheim Grøstad wrote: On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: xxx = 0; sw.reset(); sw.start(); for(double i = 0; i < 1000; i++) xxx += sin(PI*i); t = sw.peek().msecs; writeln(t); sw.stop(); } What you are doing wrong is that xxx is never used... So DMD removes it altogether? If you add writeln(xxx) after the second loop as well, then maybe the measurements make more sense? Ola. maybe, I thought I put in the writeln(xxx) in there to do that... yeah, looking at my code it's there, so that is not the problem. This is the code I had, import std.datetime; double xxx = 0; StopWatch sw; sw.start(); for(double i = 0; i < 1000; i++) xxx += sinTab(i); auto t = sw.peek().msecs; writeln(t); sw.stop(); writeln(xxx); xxx = 0; sw.reset(); sw.start(); for(double i = 0; i < 1000; i++) xxx += sin(PI*i); t = sw.peek().msecs; writeln(t); sw.stop(); writeln(xxx); Either I modified it later or deleted one of those lines or the editor screwed up when I pasted it... I don't recall touching the code afterwards. Either way, sin it's still twice as fast. Also, in the code the sinTab version is missing the writeln so it would have been faster.. so it is not being optimized out.
Re: Performance of tables slower than built in?
On Thursday, 23 May 2019 at 15:20:22 UTC, Timon Gehr wrote: On 23.05.19 12:21, Alex wrote: On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote: On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access. It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays. Surely not? I'm not sure what method is used to calculate them and maybe a table method is used internally for the common functions(maybe the periodic ones) but memory access surely is faster than multiplying doubles? ... Depends on what kind of memory access, and what kind of faster. If you hit L1 cache then a memory access might be (barely) faster than a single double multiplication. (But modern hardware usually can do multiple double multiplies in parallel, and presumably also multiple memory reads, using SIMD and/or instruction-level parallelism.) I think a single in-register double multiplication will be roughly 25 times faster than an access to main memory. Each access to main memory will pull an entire cache line from main memory to the cache, so if you have good locality (you usually won't with a LUT), your memory accesses will be faster on average. There are a lot of other microarchitectural details that can matter quite a lot for performance. I realize a lot of optimizations could be made but I still find it hard to believe that it could be done faster even with special hardware unless a LUT is used in the hardware already or some highly optimized algorithms. The whole point of general purpose computing was to get away from specialized hardware because it was cheaper and one could spend the resources in improving raw cycle performance which would benefit the entire system. But I guess because of complexity of hardware now days it's hard to really figure out what is going on. http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/ Maybe when it is an "intrinsic" it can avoid some of the issues that might otherwise cost in a LUT(such as a context switch). And most of the time these functions are computed by some series that requires many terms. I'd expect, say, to compute sin one would require at least 10 multiplies for any accuracy... and surely that is much slower than simply accessing a table(it's true that my code is more complex due to the modulos and maybe that is eating up the diff). Do you have any proof of your claims? Like a paper that discusses such things so I can see what's really going on and how they achieve such performance(and how accurate)? Not exactly what you asked, but this might help: https://www.agner.org/optimize Also, look up the CORDIC algorithm. I don't see how that can necessarily be faster. A LUT can give full 64-bit precision with one operation. The CORDIC needs iteration, at least 10 to be of any use. LUT's are precision independent assuming the creation cost is not included. I couldn't find anything in that link that specifically addressed speeding up typical math functions. It would be nice to know exactly what is going on, which functions are optimized and the typical speed up over a lut. I have some code that uses sin, exp and a few other primitive algebraic functions and in one case the code is extremely slow(it uses exp). I haven't done much testing of all this but something just seems off somewhere. Either the sin is heavily optimized and exp is not(which it too have a CORDIC like algorithm since exp(a+b) = exp(a)*exp(b)) or something else is happening that I'm not aware of.. which is why I went to tabularizing these functions in the first place. Also, I don't know the kinda accuracy they have, which doesn't help much but I could use a more accurate algorithm which is slower since it is pre-computed. That site I linked does talk about some of the issues and stuff... I guess I just need to update my thinking on modern cpus a little better. I guess there is no tool that can tell one exactly what is happening to a piece of code in the cpu... basically an instruction level profiler?
Re: Performance of tables slower than built in?
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: xxx = 0; sw.reset(); sw.start(); for(double i = 0; i < 1000; i++) xxx += sin(PI*i); t = sw.peek().msecs; writeln(t); sw.stop(); } What you are doing wrong is that xxx is never used... So DMD removes it altogether? If you add writeln(xxx) after the second loop as well, then maybe the measurements make more sense? Ola.
Re: Performance of tables slower than built in?
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? Not when I tried it with one of the online compilers, LUT is 3-4 times faster on your tight inner loop, but I guess it depends on the CPU. LUT should be very fast in long-running tight inner loops like that once the cache is warm with such as small LUT that fits in working memory (cache).
Re: Performance of tables slower than built in?
On 23.05.19 12:21, Alex wrote: On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote: On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access. It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays. Surely not? I'm not sure what method is used to calculate them and maybe a table method is used internally for the common functions(maybe the periodic ones) but memory access surely is faster than multiplying doubles? ... Depends on what kind of memory access, and what kind of faster. If you hit L1 cache then a memory access might be (barely) faster than a single double multiplication. (But modern hardware usually can do multiple double multiplies in parallel, and presumably also multiple memory reads, using SIMD and/or instruction-level parallelism.) I think a single in-register double multiplication will be roughly 25 times faster than an access to main memory. Each access to main memory will pull an entire cache line from main memory to the cache, so if you have good locality (you usually won't with a LUT), your memory accesses will be faster on average. There are a lot of other microarchitectural details that can matter quite a lot for performance. And most of the time these functions are computed by some series that requires many terms. I'd expect, say, to compute sin one would require at least 10 multiplies for any accuracy... and surely that is much slower than simply accessing a table(it's true that my code is more complex due to the modulos and maybe that is eating up the diff). Do you have any proof of your claims? Like a paper that discusses such things so I can see what's really going on and how they achieve such performance(and how accurate)? Not exactly what you asked, but this might help: https://www.agner.org/optimize Also, look up the CORDIC algorithm.
Re: Performance of tables slower than built in?
On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote: On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access. It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays. Surely not? I'm not sure what method is used to calculate them and maybe a table method is used internally for the common functions(maybe the periodic ones) but memory access surely is faster than multiplying doubles? And most of the time these functions are computed by some series that requires many terms. I'd expect, say, to compute sin one would require at least 10 multiplies for any accuracy... and surely that is much slower than simply accessing a table(it's true that my code is more complex due to the modulos and maybe that is eating up the diff). Do you have any proof of your claims? Like a paper that discusses such things so I can see what's really going on and how they achieve such performance(and how accurate)?
Re: Performance of tables slower than built in?
On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote: On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? [...] Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/} Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions. You'll directly see a 15% gain after refactoring the switch. Surely not?!?! Surely the compiler can optimize that switch since the value passed is CT? I thought the whole point of not having static switch(analogous to static if) was because it would go ahead and optimize these cases for us... and it's just a switch, just a jmp table.
Re: Performance of tables slower than built in?
On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote: On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? [...] Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/} Oh no... I meant "==" obviously Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions. You'll directly see a 15% gain after refactoring the switch.
Re: Performance of tables slower than built in?
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? [...] Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/} Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions. You'll directly see a 15% gain after refactoring the switch.
Re: Performance of tables slower than built in?
On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote: On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access. It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays. Exactly, and by the way this is old feature already. I remember like long time ago when I was studying/writing some games and I decided to test pre-computed arrays (LUT) for SIN/COS vs functions, and the later would beat the arrays pretty easily. And by the way when porting old games, sometimes you usually (If not change the game logic too much), get rid of the LUT and use functions directly. Matheus.
Re: Performance of tables slower than built in?
On 22/05/2019 12:22 PM, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? The code below uses matplotlibd but is not necessary. The timing code is lower. Ideally I'd like to have quadratic interpolation and all that but if the code is going to be slower and less accurate then what is the point? Surely I'm doing something wrong? I wasn't able to test this with LDC because it's giving me linker errors for some reason. I'm also curious as to what D is using internally in the first place. It should be the libc's. With regards to your code, it looks like your computation is wrong. I gave it a go (although refactored) and it just wouldn't output the right set of values. So performance isn't all that important right now.
Re: Performance of tables slower than built in?
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote: I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access. It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.
Performance of tables slower than built in?
I am trying to create some fast sin, sinc, and exponential routines to speed up some code by using tables... but it seems it's slower than the function itself?!? The code below uses matplotlibd but is not necessary. The timing code is lower. Ideally I'd like to have quadratic interpolation and all that but if the code is going to be slower and less accurate then what is the point? Surely I'm doing something wrong? I wasn't able to test this with LDC because it's giving me linker errors for some reason. I'm also curious as to what D is using internally in the first place. The generation uses a quarter sin since the waveform is symmetric. This does slow down the code a little but I can't imagine the few ops it uses to do this is more costly than whatever is used for computing sin. import std.math; import std.range; import std.random; import std.algorithm; import plt = matplotlibd.pyplot; import std.stdio, core.stdc.stdlib; __gshared double[] QuarterSinTab; auto GenQuarterSinTab(int res = 4*100) { res++; // Add extra endpoint to make linear interpolation easier/faster alias T = typeof(QuarterSinTab[0]); if (QuarterSinTab !is null) free(QuarterSinTab.ptr); QuarterSinTab = (cast(T*)malloc(T.sizeof*res))[0..res]; res--; for(int i = 0; i < res; i++) QuarterSinTab[i] = sin(PI*(i/cast(double)res)); QuarterSinTab[$-1] = QuarterSinTab[0]; } /* Uses the tabularized quarter sine and extends it to the full range */ auto sinTab(string Method = "Linear")(typeof(QuarterSinTab[0]) x) { alias T = typeof(QuarterSinTab[0]); auto len = QuarterSinTab.length - 1; auto s = -1; auto m = x - cast(int)x; auto m2 = 1; if (x < 0) { m = m + 1; s = 1; } if ((abs(x) % 2) < 1) s = -s; auto a = m*len; auto ai = cast(int)(m*len); switch(Method) { default: case "Linear": auto f = a - ai; return s*((1 - f)*QuarterSinTab[ai] + f*QuarterSinTab[ai+1]); case "Constant": return s*QuarterSinTab[ai]; } } void main() { GenQuarterSinTab; auto x = iota(-10, 10, 0.01).map!(x => x * PI); auto y = x.map!("sin(PI*a)"); auto z = x.map!((x){ return sin(PI*x) - sinTab(x);}); //plt.plot(x, y, "r-", ["label": "$y=sin(x)$"]); plt.plot(x, z, "b-", ["label": "$y$"]); plt.xlim(-5, 5); plt.ylim(-0.001, 0.001); plt.legend(); plt.show(); import std.datetime; double xxx = 0; StopWatch sw; sw.start(); for(double i = 0; i < 1000; i++) xxx += sinTab(i); auto t = sw.peek().msecs; writeln(t); sw.stop(); writeln(xxx); xxx = 0; sw.reset(); sw.start(); for(double i = 0; i < 1000; i++) xxx += sin(PI*i); t = sw.peek().msecs; writeln(t); sw.stop(); }