Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 12/03/2013 5:58 AM, Nigel Redmon wrote: // round up to nearest power of two unsigned int v = theSize; v--;// so we don't go up if already a power of 2 v |= v >> 1;// roll the highest bit into all lower bits... v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; v++;// and increment to power of 2 The "Hackers Delight" book is a good source for this type of thing: http://www.hackersdelight.org/ Ross. -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 15/03/2013 7:27 AM, Sampo Syreeni wrote: Quite a number of processors have/used to have explicit support for counted for loops. Has anybody tried masking against doing the inner loop as a buffer-sized counted for and only worrying about the wrap-around in an outer, second loop, the way we do it with unaligned copies, SIMD and other forms of unrolling? Yes. I usually do that when I can. I posted code earlier in the thread. Doesn't work so well if your phase increment varies in non-simple ways (ie FM). Ross. -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 15/03/2013 6:02 AM, jpff wrote: "Ross" == Ross Bencina writes: Ross> I am suspicious about whether the mask is fast than the conditional for Ross> a couple of reasons: Ross> - branch prediction works well if the branch usually falls one way Ross> - cmove (conditional move instructions) can avoid an explicit branch Ross> Once again, you would want to benchmark. I did the comparison for Csound a few months ago. The loss in using modulus over mask was more than I could contemplate my users accepting. We provide both versions for those who want non-power-of-2 tables and can take the considerable hit (gcc 4, x86_64) Hi John, I just want to clarify whether we're talking about the same thing: You wrote: John> The loss in using modulus over mask Do you mean : x = x % 255 // modulus x = x & 0xFF // mask ? Because I wrote: Ross> whether the mask is fast than the conditional Ie: x = x & 0x255 // modulus if( x == 256 ) x = 0; // conditional Note that I am referring to the case where the instruction set has CMOVE (On IA32 it was added with Pentium Pro I think). Ross. -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
RBJ's response would fit into that category I think Sampo (: On Thu, Mar 14, 2013 at 1:27 PM, Sampo Syreeni wrote: > On 2013-03-14, jpff wrote: > >> I did the comparison for Csound a few months ago. The loss in using >> modulus over mask was more than I could contemplate my users accepting. > > > Quite a number of processors have/used to have explicit support for counted > for loops. Has anybody tried masking against doing the inner loop as a > buffer-sized counted for and only worrying about the wrap-around in an > outer, second loop, the way we do it with unaligned copies, SIMD and other > forms of unrolling? > -- > Sampo Syreeni, aka decoy - de...@iki.fi, http://decoy.iki.fi/front > +358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2 > > -- > dupswapdrop -- the music-dsp mailing list and website: > subscription info, FAQ, source code archive, list archive, book reviews, dsp > links > http://music.columbia.edu/cmc/music-dsp > http://music.columbia.edu/mailman/listinfo/music-dsp -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 2013-03-14, jpff wrote: I did the comparison for Csound a few months ago. The loss in using modulus over mask was more than I could contemplate my users accepting. Quite a number of processors have/used to have explicit support for counted for loops. Has anybody tried masking against doing the inner loop as a buffer-sized counted for and only worrying about the wrap-around in an outer, second loop, the way we do it with unaligned copies, SIMD and other forms of unrolling? -- Sampo Syreeni, aka decoy - de...@iki.fi, http://decoy.iki.fi/front +358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2 -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 14/03/2013 19:02, jpff wrote: "Ross" == Ross Bencina writes: Ross> I am suspicious about whether the mask is fast than the conditional for Ross> a couple of reasons: Ross> - branch prediction works well if the branch usually falls one way Ross> - cmove (conditional move instructions) can avoid an explicit branch Ross> Once again, you would want to benchmark. I did the comparison for Csound a few months ago. The loss in using modulus over mask was more than I could contemplate my users accepting. We provide both versions for those who want non-power-of-2 tables and can take the considerable hit (gcc 4, x86_64) ==John ffitch -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp I've never used Modulus in a ring buffer, mainly because I recoil in horror at every division I see! :) If you have a lot of buffers, like in a reverb, I found it's best to only use the memory that's needed, when I replaced the ANDs with IFs my reverb was much more efficient. I guess it depends on your uses, but my slowdown was caused by memory cache issues with the higher than needed buffer sizes. Dave. -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
I'm sure it varies from hardware to hardware too, so always good to know your options On Thu, Mar 14, 2013 at 12:02 PM, jpff wrote: >> "Ross" == Ross Bencina writes: > > Ross> I am suspicious about whether the mask is fast than the conditional for > Ross> a couple of reasons: > > Ross> - branch prediction works well if the branch usually falls one way > > Ross> - cmove (conditional move instructions) can avoid an explicit branch > > Ross> Once again, you would want to benchmark. > > I did the comparison for Csound a few months ago. The loss in using > modulus over mask was more than I could contemplate my users > accepting. We provide both versions for those who want non-power-of-2 > tables and can take the considerable hit (gcc 4, x86_64) > > ==John ffitch > -- > dupswapdrop -- the music-dsp mailing list and website: > subscription info, FAQ, source code archive, list archive, book reviews, dsp > links > http://music.columbia.edu/cmc/music-dsp > http://music.columbia.edu/mailman/listinfo/music-dsp -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
> "Ross" == Ross Bencina writes: Ross> I am suspicious about whether the mask is fast than the conditional for Ross> a couple of reasons: Ross> - branch prediction works well if the branch usually falls one way Ross> - cmove (conditional move instructions) can avoid an explicit branch Ross> Once again, you would want to benchmark. I did the comparison for Csound a few months ago. The loss in using modulus over mask was more than I could contemplate my users accepting. We provide both versions for those who want non-power-of-2 tables and can take the considerable hit (gcc 4, x86_64) ==John ffitch -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
The most important reason to write clear & commented code is.. future yourself, anyway. You're pretty much a stranger to your own code when you look at it years later. -Message d'origine- From: robert bristow-johnson Sent: Monday, March 11, 2013 9:38 PM To: music-dsp@music.columbia.edu Subject: Re: [music-dsp] Efficiency of clear/copy/offset buffers On 3/11/13 4:25 PM, Theo Verelst wrote: A lot of the considerations of course have to do with trying to make maintainable, and therefore readable code. ... Of course fancy looking constructs are cool, probably in industry it is sometimes the only way to keep secrets from the competition, ha-ha! this was similar to what i was telling a certain director (you guys would certainly recognize his name) of a certain synthesizer R&D division in 2007 or 2008. making sure you don't hire the mole is how you protect proprietary code. uncommented spaghetti code is a very stupid way to protect secret code because even the good guys whom are hired to develop the code further can't figure it out. uncommented, poorly written spaghetti code has a negative productivity measure. you waste more time trying to figure it out and how you will have to modify it than just writing decent code to begin with. -- r b-j r...@audioimagination.com "Imagination is more important than knowledge." -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp - Aucun virus trouve dans ce message. Analyse effectuee par AVG - www.avg.fr Version: 2012.0.2240 / Base de donnees virale: 2641/5664 - Date: 11/03/2013 -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 3/11/13 4:25 PM, Theo Verelst wrote: A lot of the considerations of course have to do with trying to make maintainable, and therefore readable code. ... Of course fancy looking constructs are cool, probably in industry it is sometimes the only way to keep secrets from the competition, ha-ha! this was similar to what i was telling a certain director (you guys would certainly recognize his name) of a certain synthesizer R&D division in 2007 or 2008. making sure you don't hire the mole is how you protect proprietary code. uncommented spaghetti code is a very stupid way to protect secret code because even the good guys whom are hired to develop the code further can't figure it out. uncommented, poorly written spaghetti code has a negative productivity measure. you waste more time trying to figure it out and how you will have to modify it than just writing decent code to begin with. -- r b-j r...@audioimagination.com "Imagination is more important than knowledge." -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
A lot of the considerations of course have to do with trying to make maintainable, and therefore readable code. That's good, but it often is not all too clear what DSP constructs have to do with enums or digital logic definitions, even though of course the P in DSP is done by digital logic. Various software packages, from C++ to "low level" assemly type code aren't clear much in what is being optimized here: code readability, programming efficiency, typing/coding efficiency, compiler preprocessor main compiler or linker efficiency (probably not a big factor in most DSP), code memory efficiency or code processing speed efficiency, and probably some more options (loadable module efficiency, dynamic linking, data structure sharing between processes, cache coherency management, etc etc.) Of course fancy looking constructs are cool, probably in industry it is sometimes the only way to keep secrets from the competition, but, to stay with the title of the thread: I suppose in many ways clearing and copying buffers is generally not so much execution efficient in many cases, but can be neat, or prevent buffer swithching aritfacts, and of course be needed to fill memory contents for DMA-type processing. With respect to the code neatness and "programmer" efficiency, I doubt much of it appeals to me. Usually the mix of the various efficiency tends to communicate to me "not enough programmer" intelligence" more than most other things (like "I program therefore I exist"). However important this may all be, just like certain "music" kinds I don't take the original and the well-intended concepts unseriously, but I'd like progress instead of technology conservation, though of course there is decent place for that in modern society. I tend to work with complicated "blocks". My Jack/Ladspa combination blocks are hard to manage (I use scripts) but can give out music processing I like. I am absolutely sure a lot of buffers on my I7 aren't used to the full extend of their potential at all, but hey, it is a lot of work to go down to the level of ants and make specific algorithms efficient. Also, when it isn't yet clear which algorithms and which code blocks are going to be important: I'll rather not humiliate myself with that. Theo V. -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
actually a way to do it is to create these power-of-2 circular buffers at one level (like a higher "system" level) and allocate delay lines *within* an already-created circular buffer. you might need more than one circular buffer if you have, inside your algorithm or system of algorithms, different sampling rates (because the pointers move at different rates). so you might have something like typedef struct { int *bufferBase; int sampleRate; unsigned long indexMask; // initialized to bufferSize-1 long lastAssignedIndex; // initialized to bufferSize } circularBuffer; boolean initCircularBuffer(circularBuffer *thisCircularBuffer, unsigned long bufferSize, int thisSampleRate); // this malloc() the space, assigns the initial values, returns 0 if successful // bufferSize must be a power of 2 then, each time you need a delay line: typedef struct { circularBuffer *thisBuffer; unsigned long writeIndex; // initted to lastAssignedIndex-maxDelayNeeded unsigned long delay;// init to 0, i guess. } delayLine; boolean initDelayLine(circularBuffer *thisCircularBuffer, delayLine *thisDelay, unsigned long maxDelayNeeded); // this allocates space in the buffer at lastAssignedIndex-maxDelayNeeded and saves that back to lastAssignedIndex (and it must be at least 0 if no error). maxDelayNeeded need not be a power of 2. sampleRate can be left out if the programmer chooses to keep track of which circular buffer is running at which sampling rate. just an idea. -- r b-j r...@audioimagination.com "Imagination is more important than knowledge." On 3/11/13 2:58 PM, Nigel Redmon wrote: A way to do it at run time (the caller requests an arbitrary size, the routine enforces a power of 2): // round up to nearest power of two unsigned int v = theSize; v--;// so we don't go up if already a power of 2 v |= v>> 1;// roll the highest bit into all lower bits... v |= v>> 2; v |= v>> 4; v |= v>> 8; v |= v>> 16; v++;// and increment to power of 2 This is especially handy when the necessary buffer size depends on some user preferences, etc., and there are many possibilities. I'm sure you could something similar with template meta programming if it's something that is set at compile time, thereby not forcing the caller to jump through hoops, while still getting the most efficient implementation. On Mar 11, 2013, at 11:24 AM, Alan Wolfe wrote: interesting idea about rounding up and letting multiple buffers using the memory. Very nice. I just wanted to add on the front of enforcing powers of 2 sizes, the way you have it where you pass in an integer and it understand that as a power of 2 is nice but of course a little less intuitive to the user than saying "i want 1024 samples". They pass a 10 in and whenever they see that number in the code, they have to spend time thinking or remembering what it means. another way that could be a nicety could be to use an enum to get the best of both worlds enum EBufferSizes { //... etc kBufferSize_512 = 9, kBufferSize_1024 = 10, //.. etc } Sure they could just put in ints instead of using your enum (some compilers might make warnings for that at least, or allow you to tell them to make warnings for that), but it could be a nice step to making the interface a little nicer while still having the safety / ease of use in your example. On Mon, Mar 11, 2013 at 11:06 AM, robert bristow-johnson wrote: On 3/11/13 11:19 AM, Phil Burk wrote: Regarding power-of-2 sized circular buffers, here is a handy way to verify that a bufferSize parameter is actually a power-of-2: int init_circular_buffer( int bufferSize, ... ) { assert( (bufferSize& (bufferSize-1)) == 0 ) ... might be silly, but a way to force the caller to constrain it to a power of 2 is: init_circular_buffer( int logBufferSize, ... ) { unsigned long bufferSize = 1L<< logBufferSize; unsigned long indexMask = bufferSize - 1; ... On 3/11/13 2:59 AM, Nigel Redmon wrote: Also a note that the modulo-by-AND indexing is built into some processors—the 56K family, at least, as Robert knows well…buffers are the next power of two higher than the space needed, and the masking happens for free… actually the 56K and other DSPs (like the SHArC) can do buffers of any size below 32K. the 56K has a restriction that the base address of the buffer must be an integer multiple of a power of 2 that is at least as big as the bufferSize. the modulo arithmetic doesn't really happen for free. choosing to use a DSP over a cheap ARM chip or something similar has both advantages and disadvantages. and they have to put a bunch of logic on the chip for the modulo. even the 563xx chip has that 32K restriction, even though the address space increased to 16M. such a shame. you have minutes of addressing space, but your modulo delay lines are still limited to less than a second a
Re: [music-dsp] Efficiency of clear/copy/offset buffers
A way to do it at run time (the caller requests an arbitrary size, the routine enforces a power of 2): // round up to nearest power of two unsigned int v = theSize; v--;// so we don't go up if already a power of 2 v |= v >> 1;// roll the highest bit into all lower bits... v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; v++;// and increment to power of 2 This is especially handy when the necessary buffer size depends on some user preferences, etc., and there are many possibilities. I'm sure you could something similar with template meta programming if it's something that is set at compile time, thereby not forcing the caller to jump through hoops, while still getting the most efficient implementation. On Mar 11, 2013, at 11:24 AM, Alan Wolfe wrote: > interesting idea about rounding up and letting multiple buffers using > the memory. Very nice. > > I just wanted to add on the front of enforcing powers of 2 sizes, the > way you have it where you pass in an integer and it understand that as > a power of 2 is nice but of course a little less intuitive to the user > than saying "i want 1024 samples". They pass a 10 in and whenever > they see that number in the code, they have to spend time thinking or > remembering what it means. > > another way that could be a nicety could be to use an enum to get the > best of both worlds > > enum EBufferSizes > { > //... etc > kBufferSize_512 = 9, > kBufferSize_1024 = 10, > //.. etc > } > > Sure they could just put in ints instead of using your enum (some > compilers might make warnings for that at least, or allow you to tell > them to make warnings for that), but it could be a nice step to making > the interface a little nicer while still having the safety / ease of > use in your example. > > On Mon, Mar 11, 2013 at 11:06 AM, robert bristow-johnson > wrote: >> On 3/11/13 11:19 AM, Phil Burk wrote: >>> >>> Regarding power-of-2 sized circular buffers, here is a handy way to >>> verify that a bufferSize parameter is actually a power-of-2: >>> >>> int init_circular_buffer( int bufferSize, ... ) >>> { >>> assert( (bufferSize & (bufferSize-1)) == 0 ) >>> ... >>> >> >> >> might be silly, but a way to force the caller to constrain it to a power of >> 2 is: >> >> init_circular_buffer( int logBufferSize, ... ) >> { >> unsigned long bufferSize = 1L << logBufferSize; >> unsigned long indexMask = bufferSize - 1; >> ... >> >> >> >> On 3/11/13 2:59 AM, Nigel Redmon wrote: >>> >>> Also a note that the modulo-by-AND indexing is built into some >>> processors—the 56K family, at least, as Robert knows well…buffers are the >>> next power of two higher than the space needed, and the masking happens for >>> free… >> >> >> actually the 56K and other DSPs (like the SHArC) can do buffers of any size >> below 32K. the 56K has a restriction that the base address of the buffer >> must be an integer multiple of a power of 2 that is at least as big as the >> bufferSize. the modulo arithmetic doesn't really happen for free. choosing >> to use a DSP over a cheap ARM chip or something similar has both advantages >> and disadvantages. and they have to put a bunch of logic on the chip for the >> modulo. even the 563xx chip has that 32K restriction, even though the >> address space increased to 16M. such a shame. you have minutes of addressing >> space, but your modulo delay lines are still limited to less than a second >> at any decent sampling rate. >> >> but what you can do with C where you might have a bunch of different delay >> lines (like in a Shroeder/Jot reverb), all running at the same sampling >> rate, is create a *single* circular buffer that has length that is a power >> of 2. then each little delay line can have a piece of that buffer allocated, >> but all of the allocations move at the same rate. the various delay line >> allocations are "stationary" relative to each other. >> >> it can be compared to an analog tape delay like this. you have a fixed >> amount of tape media but as many record and playback heads as your heart >> desires. so instead of cutting a separate loop of tape (which has to be of >> length equal to a power of two) and connect that up to a record and playback >> head, you create one big loop of tape and put a record/playback head pair >> for each delay on the tape loop at different locations. >> >> that way you can efficiently allocate a delay line of 129 or 257 or 4097 >> samples long along with a bunch of others. only the whole big buffer need be >> of length 2^p . >> >> >> -- >> >> r b-j r...@audioimagination.com >> >> "Imagination is more important than knowledge." >> >> >> >> -- >> dupswapdrop -- the music-dsp mailing list and website: >> subscription info, FAQ, source code archive, list archive, book reviews, dsp >> links >> http://music.columbia.edu/cmc/music-dsp >> http://music.columbia.edu/mailman/listinfo/music-dsp > -- > dupswapdrop -- t
Re: [music-dsp] Efficiency of clear/copy/offset buffers
interesting idea about rounding up and letting multiple buffers using the memory. Very nice. I just wanted to add on the front of enforcing powers of 2 sizes, the way you have it where you pass in an integer and it understand that as a power of 2 is nice but of course a little less intuitive to the user than saying "i want 1024 samples". They pass a 10 in and whenever they see that number in the code, they have to spend time thinking or remembering what it means. another way that could be a nicety could be to use an enum to get the best of both worlds enum EBufferSizes { //... etc kBufferSize_512 = 9, kBufferSize_1024 = 10, //.. etc } Sure they could just put in ints instead of using your enum (some compilers might make warnings for that at least, or allow you to tell them to make warnings for that), but it could be a nice step to making the interface a little nicer while still having the safety / ease of use in your example. On Mon, Mar 11, 2013 at 11:06 AM, robert bristow-johnson wrote: > On 3/11/13 11:19 AM, Phil Burk wrote: >> >> Regarding power-of-2 sized circular buffers, here is a handy way to >> verify that a bufferSize parameter is actually a power-of-2: >> >> int init_circular_buffer( int bufferSize, ... ) >> { >> assert( (bufferSize & (bufferSize-1)) == 0 ) >> ... >> > > > might be silly, but a way to force the caller to constrain it to a power of > 2 is: > > init_circular_buffer( int logBufferSize, ... ) > { > unsigned long bufferSize = 1L << logBufferSize; > unsigned long indexMask = bufferSize - 1; > ... > > > > On 3/11/13 2:59 AM, Nigel Redmon wrote: >> >> Also a note that the modulo-by-AND indexing is built into some >> processors—the 56K family, at least, as Robert knows well…buffers are the >> next power of two higher than the space needed, and the masking happens for >> free… > > > actually the 56K and other DSPs (like the SHArC) can do buffers of any size > below 32K. the 56K has a restriction that the base address of the buffer > must be an integer multiple of a power of 2 that is at least as big as the > bufferSize. the modulo arithmetic doesn't really happen for free. choosing > to use a DSP over a cheap ARM chip or something similar has both advantages > and disadvantages. and they have to put a bunch of logic on the chip for the > modulo. even the 563xx chip has that 32K restriction, even though the > address space increased to 16M. such a shame. you have minutes of addressing > space, but your modulo delay lines are still limited to less than a second > at any decent sampling rate. > > but what you can do with C where you might have a bunch of different delay > lines (like in a Shroeder/Jot reverb), all running at the same sampling > rate, is create a *single* circular buffer that has length that is a power > of 2. then each little delay line can have a piece of that buffer allocated, > but all of the allocations move at the same rate. the various delay line > allocations are "stationary" relative to each other. > > it can be compared to an analog tape delay like this. you have a fixed > amount of tape media but as many record and playback heads as your heart > desires. so instead of cutting a separate loop of tape (which has to be of > length equal to a power of two) and connect that up to a record and playback > head, you create one big loop of tape and put a record/playback head pair > for each delay on the tape loop at different locations. > > that way you can efficiently allocate a delay line of 129 or 257 or 4097 > samples long along with a bunch of others. only the whole big buffer need be > of length 2^p . > > > -- > > r b-j r...@audioimagination.com > > "Imagination is more important than knowledge." > > > > -- > dupswapdrop -- the music-dsp mailing list and website: > subscription info, FAQ, source code archive, list archive, book reviews, dsp > links > http://music.columbia.edu/cmc/music-dsp > http://music.columbia.edu/mailman/listinfo/music-dsp -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 3/11/13 11:19 AM, Phil Burk wrote: Regarding power-of-2 sized circular buffers, here is a handy way to verify that a bufferSize parameter is actually a power-of-2: int init_circular_buffer( int bufferSize, ... ) { assert( (bufferSize & (bufferSize-1)) == 0 ) ... might be silly, but a way to force the caller to constrain it to a power of 2 is: init_circular_buffer( int logBufferSize, ... ) { unsigned long bufferSize = 1L << logBufferSize; unsigned long indexMask = bufferSize - 1; ... On 3/11/13 2:59 AM, Nigel Redmon wrote: Also a note that the modulo-by-AND indexing is built into some processors—the 56K family, at least, as Robert knows well…buffers are the next power of two higher than the space needed, and the masking happens for free… actually the 56K and other DSPs (like the SHArC) can do buffers of any size below 32K. the 56K has a restriction that the base address of the buffer must be an integer multiple of a power of 2 that is at least as big as the bufferSize. the modulo arithmetic doesn't really happen for free. choosing to use a DSP over a cheap ARM chip or something similar has both advantages and disadvantages. and they have to put a bunch of logic on the chip for the modulo. even the 563xx chip has that 32K restriction, even though the address space increased to 16M. such a shame. you have minutes of addressing space, but your modulo delay lines are still limited to less than a second at any decent sampling rate. but what you can do with C where you might have a bunch of different delay lines (like in a Shroeder/Jot reverb), all running at the same sampling rate, is create a *single* circular buffer that has length that is a power of 2. then each little delay line can have a piece of that buffer allocated, but all of the allocations move at the same rate. the various delay line allocations are "stationary" relative to each other. it can be compared to an analog tape delay like this. you have a fixed amount of tape media but as many record and playback heads as your heart desires. so instead of cutting a separate loop of tape (which has to be of length equal to a power of two) and connect that up to a record and playback head, you create one big loop of tape and put a record/playback head pair for each delay on the tape loop at different locations. that way you can efficiently allocate a delay line of 129 or 257 or 4097 samples long along with a bunch of others. only the whole big buffer need be of length 2^p . -- r b-j r...@audioimagination.com "Imagination is more important than knowledge." -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
Regarding power-of-2 sized circular buffers, here is a handy way to verify that a bufferSize parameter is actually a power-of-2: int init_circular_buffer( int bufferSize, ... ) { assert( (bufferSize & (bufferSize-1)) == 0 ) Phil Burk -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
>>On Mar 9, 2013, at 9:23 PM, robert bristow-johnson >>audioimagination.com> wrote: >> if it's a wavetable and you're doing linear interpolation, a simple >>trick is to copy x[0] to x[256] (make a 256 point wavetable a 257 >>element array) do the AND only for the first sample in the linear >>interpolation, the follow sample will always just follow and you need >>not AND the index for that second sample. >I used that technique for decades on packet-basic queues also, adding >max-1 to the buffer—for instance, MIDI manager had a maximum packet >size of 256, so extending the queue by 255 let me copy any packet >without checking for wrap in my ancient MIDI stuff. Yeah, it's always a bit of a hassle when making sure the communications processor, Unix or otherwise processor computing a memory or paging segment index (issuessince the 70s), or,like it was a hobby of mine in the early 80s, when a microprocessor would cycle an audio (sample) fragment gracefully and efficient. Mind that the whole discussion should mention the efficiency versus the elegance of the solution: the code for a "modulo + bit-and" can be short, easy to debug on the one hand, it can execute efficient on the other hand, combined with that the infrastructure of the processor that runs it. However, often the memory architecture and memory access (in-)efficiency are more a perfromance bottleneck than most other things, so unless you access a fast IO device the whole technique probably has only marginal relevance, and probably doesn;t prevent response time jitter (as a conditional statement and of course cache-data-dependency and precise interrupt start-timing deviation because of the "current instruction" issues can do). I'm glad the risk for mangling ants instead of finding universal truths in this computer design issue in this time is limited, because of the return of micro-programming hardware to the point of even defining operations completely with random hardware in FPGA allowing for instance in a (Xilinx) Zinq processor to make an efficient DMA+processing unit to combine efficiently with a ARM core. Of course ease of programming" or the option of ripping some Open Source code is also an issue, sometimes for the not honor-challenged. T.Verelst -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On Mar 9, 2013, at 9:23 PM, robert bristow-johnson wrote: > if it's a wavetable and you're doing linear interpolation, a simple trick is > to copy x[0] to x[256] (make a 256 point wavetable a 257 element array) do > the AND only for the first sample in the linear interpolation, the follow > sample will always just follow and you need not AND the index for that second > sample. I used that technique for decades on packet-basic queues also, adding max-1 to the buffer—for instance, MIDI manager had a maximum packet size of 256, so extending the queue by 255 let me copy any packet without checking for wrap in my ancient MIDI stuff. Also a note that the modulo-by-AND indexing is built into some processors—the 56K family, at least, as Robert knows well…buffers are the next power of two higher than the space needed, and the masking happens for free… -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 10/03/2013 05:23, robert bristow-johnson wrote: .. ANDing with 2^p - 1 is a well-known and oft-used technique in C. probably the best way to do it in C. ... if it's a wavetable and you're doing linear interpolation, a simple trick is to copy x[0] to x[256] (make a 256 point wavetable a 257 element array) do the AND only for the first sample in the linear interpolation, the follow sample will always just follow and you need not AND the index for that second sample. There are useful examples of this in the Csound codebase, specifically the "oscili" family of opcodes, based on reading a function table where the extra element at the end is called a "guard point". For various reasons, Csound was extended a while back to allow non power-of-two table sizes, but the original opcodes are preserved. I included an discussion (with some standalone code) of the Csound oscillator in the Audio Programming Book, in the context of the description of the C bitwise operators. The "oscili" opcode is also interesting for the way it handles the fractional part of the interpolation using integer operations in the manner of a fixed-point computation; all of the above conspiring to make the Csound oscillator famously fast. There are reasons enough to have conditional tests inside a per-sample loop, but where they can be avoided, significant speedups can be achieved. Richard Dobson -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
>>> Though recent gcc versions will replace the above "a/3.14" with a >>> multiplication, I remember a case where the denominator was constant >>> as well but not quite as explicitly stated, where gcc 4.x produced a >>> division instruction. >> >> not necessarily: in floating point math a/b and a * (1/b) do not yield >> the same result. therefore the compile should not optimize this, unless >> explicitly asked to do so (-freciprocal-math) > > I should have added, "when employing the usual suspects, -ffast-math > -O6 etc, as you usually would when compiling DSP code". Sorry! -ffast-math should not be used without knowing what it actually does, as it might break code in subtle ways. e.g. supercollider makes use of NaNs to represent an asynchronous `demand rate'. this breaks, if the compiler assumes that all floating point math is finite ... tim -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 10/03/2013 3:51 PM, Alan Wolfe wrote: > index = index + 1; > if (index >= count) > index = 0; Another, more compact way could be to do it this way: index = (index + 1) % count; I am suspicious about whether the mask is fast than the conditional for a couple of reasons: - branch prediction works well if the branch usually falls one way - cmove (conditional move instructions) can avoid an explicit branch Once again, you would want to benchmark. There's a neat technique to do this faster that I have to admit i got from Ross's code a few years ago in his audio library PortAudio. Probably I learnt that from Phil Burk. He is the master of bitmasks and circular buffers. There are also some other tricks in the PA code beyond what you mention here too... That technique requires that your circular buffer is a power of 2, but so long as that is true, you can do an AND to get the remainder of the division. AND is super fast (even faster than the if / set) so it's a great improvement. How you do that looks like the below, assuming that your circular buffer is 1024 samples large: index = ((index + 1) & 1023); // 1023 is just 1024-1 if your buffer was 256 samples large it would look like this: index = ((index + 1) & 255); // 255 is just 256 - 1 Super useful trick so wanted to share it with ya (: My preferred technique is to avoid tests and masks in the inner loop by precomputing the loop length and hoisting the tests: int samplesToProcess = ? int i=0; while( samplesToProcess > 0 ){ int samplesToEndOfBuffer = bufferSize - index; int n = min( samplesToEndOfBuffer, samplesToProcess ); for( int j=0; j < n; ++j ){ output[i++] = buffer[ index++ ]; } if( index == count ) index = 0; // wrap index samplesToProcess -= n; } this way the index is only ever tested/incremented outside the loop (no masking or no conditionals in the inner loop). So long as the increment is significantly shorter than the buffer length you can make it work for non-integer increments too. You can do this with more than one test (hoisting multiple unrelated conditionals from the inner loop). Ross. On Sat, Mar 9, 2013 at 12:14 PM, Tim Goetze wrote: [Tim Blechmann] Though recent gcc versions will replace the above "a/3.14" with a multiplication, I remember a case where the denominator was constant as well but not quite as explicitly stated, where gcc 4.x produced a division instruction. not necessarily: in floating point math a/b and a * (1/b) do not yield the same result. therefore the compile should not optimize this, unless explicitly asked to do so (-freciprocal-math) I should have added, "when employing the usual suspects, -ffast-math -O6 etc, as you usually would when compiling DSP code". Sorry! Tim -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 10/03/2013 7:01 AM, Tim Goetze wrote: [robert bristow-johnson] >On 3/9/13 1:31 PM, Wen Xue wrote: >>I think one can trust the compiler to handle a/3.14 as a multiplication. If it >>doesn't it'd probably be worse to write a*(1/3.14), for this would be a >>division AND a multiplication. > >there are some awful crappy compilers out there. even ones that start from gnu >and somehow become a product sold for use with some DSP. Though recent gcc versions will replace the above "a/3.14" with a multiplication, I remember a case where the denominator was constant as well but not quite as explicitly stated, where gcc 4.x produced a division instruction. I don't think this has anything to do with "crappy compilers" Unless multiplication by reciprocal gives exactly the same result -- with the same precision and the same rounding behavior and the same denormal behavior etc then it would be *incorrect* to automatically replace division by multiplicaiton by reciprocal. So I think it's more a case of conformant compilers, not crappy compilers. I have always assumed that it is not (in general) valid for the compiler to automatically perform the replacement; and the ony reason we can get away with it is because we make certain simplifying assumptions. Ross. -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 3/9/13 11:51 PM, Alan Wolfe wrote: There's a neat technique to do this faster that I have to admit i got from Ross's code a few years ago in his audio library PortAudio. That technique requires that your circular buffer is a power of 2, but so long as that is true, you can do an AND to get the remainder of the division. AND is super fast (even faster than the if / set) so it's a great improvement. How you do that looks like the below, assuming that your circular buffer is 1024 samples large: index = ((index + 1)& 1023); // 1023 is just 1024-1 if your buffer was 256 samples large it would look like this: index = ((index + 1)& 255); // 255 is just 256 - 1 Super useful trick so wanted to share it with ya (: ANDing with 2^p - 1 is a well-known and oft-used technique in C. probably the best way to do it in C. to have to bit-wise AND by 0x00FF (or whatever) for every time a sample is moved into or fetched from the delay line can sometimes be burdensome. i have seen C code for a simple FIR filter that performs that AND only once per sample. there are two loops (one after the other) to sum in the taps. otherwise, i dunno how to ever avoid the bitwise AND. if it's a wavetable and you're doing linear interpolation, a simple trick is to copy x[0] to x[256] (make a 256 point wavetable a 257 element array) do the AND only for the first sample in the linear interpolation, the follow sample will always just follow and you need not AND the index for that second sample. you can extend the idea more for higher order interpolation like with a precision delay. let's say you have a 4096 sample delay buffer and your interpolation needs 16 samples for a good band-limited, sinc-like interpolation. then whatever samples that go into x[0] through x[15], must also be copied over to x[4096] through x[4111]. but your initial index would always be masked by (4096-1). it's just that when you tear through your interpolation, you need not worry about the indices of the 16 samples you would be using in your convolution. but worst case timing has it that you have to copy those 16 samples at the correct time. -- r b-j r...@audioimagination.com "Imagination is more important than knowledge." -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
Hey while we are on the topic of efficiency and the OP not knowing that division was slower... Often times in DSP you'll use circular buffers (like for delay buffers for instance). Those are often implemented by having an array, and an index into the array for where the next sample should go. When you put a sample into the buffer, you increment the index and then make sure that if the index into the array is out of bounds, that it gets set back to zero so that it continually goes through the array in a circular fashion (thus the name circular buffer!). Incrementing the index could be implemented like this: index = index + 1; if (index >= count) index = 0; Another, more compact way could be to do it this way: index = (index + 1) % count; In that last one, it uses the modulo operator to get the remainder of a division to make sure the index is within range. The modulo operator has to pay the full cost of the divide though to figure out the remainder so it is the same cost as a division (talked about earlier!). There's a neat technique to do this faster that I have to admit i got from Ross's code a few years ago in his audio library PortAudio. That technique requires that your circular buffer is a power of 2, but so long as that is true, you can do an AND to get the remainder of the division. AND is super fast (even faster than the if / set) so it's a great improvement. How you do that looks like the below, assuming that your circular buffer is 1024 samples large: index = ((index + 1) & 1023); // 1023 is just 1024-1 if your buffer was 256 samples large it would look like this: index = ((index + 1) & 255); // 255 is just 256 - 1 Super useful trick so wanted to share it with ya (: On Sat, Mar 9, 2013 at 12:14 PM, Tim Goetze wrote: > [Tim Blechmann] >>> Though recent gcc versions will replace the above "a/3.14" with a >>> multiplication, I remember a case where the denominator was constant >>> as well but not quite as explicitly stated, where gcc 4.x produced a >>> division instruction. >> >>not necessarily: in floating point math a/b and a * (1/b) do not yield >>the same result. therefore the compile should not optimize this, unless >>explicitly asked to do so (-freciprocal-math) > > I should have added, "when employing the usual suspects, -ffast-math > -O6 etc, as you usually would when compiling DSP code". Sorry! > > Tim > -- > dupswapdrop -- the music-dsp mailing list and website: > subscription info, FAQ, source code archive, list archive, book reviews, dsp > links > http://music.columbia.edu/cmc/music-dsp > http://music.columbia.edu/mailman/listinfo/music-dsp -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
[Tim Blechmann] >> Though recent gcc versions will replace the above "a/3.14" with a >> multiplication, I remember a case where the denominator was constant >> as well but not quite as explicitly stated, where gcc 4.x produced a >> division instruction. > >not necessarily: in floating point math a/b and a * (1/b) do not yield >the same result. therefore the compile should not optimize this, unless >explicitly asked to do so (-freciprocal-math) I should have added, "when employing the usual suspects, -ffast-math -O6 etc, as you usually would when compiling DSP code". Sorry! Tim -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
>>> I think one can trust the compiler to handle a/3.14 as a multiplication. If >>> it >>> >> doesn't it'd probably be worse to write a*(1/3.14), for this would be a >>> >> division AND a multiplication. >> > >> > there are some awful crappy compilers out there. even ones that start >> > from gnu >> > and somehow become a product sold for use with some DSP. > Though recent gcc versions will replace the above "a/3.14" with a > multiplication, I remember a case where the denominator was constant > as well but not quite as explicitly stated, where gcc 4.x produced a > division instruction. not necessarily: in floating point math a/b and a * (1/b) do not yield the same result. therefore the compile should not optimize this, unless explicitly asked to do so (-freciprocal-math) tim -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
[robert bristow-johnson] > On 3/9/13 1:31 PM, Wen Xue wrote: >> I think one can trust the compiler to handle a/3.14 as a multiplication. If >> it >> doesn't it'd probably be worse to write a*(1/3.14), for this would be a >> division AND a multiplication. > > there are some awful crappy compilers out there. even ones that start from > gnu > and somehow become a product sold for use with some DSP. Though recent gcc versions will replace the above "a/3.14" with a multiplication, I remember a case where the denominator was constant as well but not quite as explicitly stated, where gcc 4.x produced a division instruction. (I think the denominator was a c++ template parameter subjected to a binary shift operator but my memory of the exact circumstances is hazy. I do remember it was easy to evaluate at compile time, and my surprise at the compiler not getting it.) Tim -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
[music-dsp] Efficiency of clear/copy/offset buffers
> There is some discussion here: > http://www.kvraudio.com/forum/viewtopic.php?t=348751 Hey Ross, this was a good thread. So the pcture I'm getting is this: 1) a gain curve that is linear in dB will produce an ideal profile where equal incremental movements along the curve produce an equal changes in perceived volume. 2) you can never get this curve to hit zero amplitude, which occurs at dB=-infinity, so you have to patch in a linear segment at the bottom to deal with this 3) a quick-and-dirty solution that approximates the linear dB gain curve (and handles the zero amplitude case automatically) is a simple x^2 curve (in range 0.0 to 1.0) Does it sound like I have the right end of the stick? > When multiplying, you can do all the necessary multiplications in parallel > (think of performing a long multiply by hand?1234 x 5678 for instance. > It's easy to imagine how you could speed this up by having a few friends help > you Nigel, thanks for this insight. Makes perfect sense. Regards, Stephen Clarke Managing Director ChordWizard Software Pty Ltd corpor...@chordwizard.com http://www.chordwizard.com ph: (+61) 2 4960 9520 fax: (+61) 2 4960 9580 -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On Mar 8, 2013, at 10:55 PM, Ross Bencina wrote: > If your input is MIDI master volume you have to map from the MIDI value range > to linear gain (perhaps via decibels). Maybe there is a standard curve for > this? There may be standards for subsets, such as GM. Or perhaps even a more global standard nowadays. Dunno. I have many older MIDI hardware synthesizers, from years when collecting them was almost a sickness. The older ones didn't appear to follow any standard and in fact even similar models from the same manufacturer couldn't be expected to follow an identical "company standard" volume curve. Maybe there is more consistency nowadays? Tis been many years since buying a new hardware synth. -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 3/9/13 1:31 PM, Wen Xue wrote: I think one can trust the compiler to handle a/3.14 as a multiplication. If it doesn't it'd probably be worse to write a*(1/3.14), for this would be a division AND a multiplication. there are some awful crappy compilers out there. even ones that start from gnu and somehow become a product sold for use with some DSP. i think this guy named Michael Kahl should be the compiler czar of the world. no compiler or development system is released anywhere in the world without his design and/or approval of it. -- r b-j r...@audioimagination.com "Imagination is more important than knowledge." -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
I think one can trust the compiler to handle a/3.14 as a multiplication. If it doesn't it'd probably be worse to write a*(1/3.14), for this would be a division AND a multiplication. -Original Message- From: Nigel Redmon Sent: Saturday, March 09, 2013 5:15 PM To: A discussion list for music-related DSP Subject: Re: [music-dsp] Efficiency of clear/copy/offset buffers On Mar 8, 2013, at 2:53 PM, ChordWizard Software wrote: But some are quite new - I never realised that multiplication ops were more efficient than divisions. Worthy of some background... When multiplying, you can do all the necessary multiplications in parallel (think of performing a long multiply by hand—1234 x 5678 for instance. It's easy to imagine how you could speed this up by having a few friends help you, where you manage the first digit, 4 x 5678, another handles 3(0) x 5678, etc., at the same time.) but when you divide, you need to finish one digit before you know what the remainder is and you can move to the next digit. There's no way to look ahead—you need the result of the first step before doing the second. So, processors optimize multiplication and addition with parallel circuits, but division is iterated in a microcode loop (or done entirely in software). The 56K DSPs, for instance have a single-cycle multiply, but for division, "DIV" is a single division iteration—you need to do it for every digit you need to generate. It's just the nature of the operation. Compilers may help you optimize constants, but it's always best to keep track of things yourself so you know what you're getting. So, yes, multiply by the sample period instead of dividing by the sample rate, etc. -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On Mar 8, 2013, at 2:53 PM, ChordWizard Software wrote: > But some are quite new - I never realised that multiplication ops were more > efficient than divisions. Worthy of some background... When multiplying, you can do all the necessary multiplications in parallel (think of performing a long multiply by hand—1234 x 5678 for instance. It's easy to imagine how you could speed this up by having a few friends help you, where you manage the first digit, 4 x 5678, another handles 3(0) x 5678, etc., at the same time.) but when you divide, you need to finish one digit before you know what the remainder is and you can move to the next digit. There's no way to look ahead—you need the result of the first step before doing the second. So, processors optimize multiplication and addition with parallel circuits, but division is iterated in a microcode loop (or done entirely in software). The 56K DSPs, for instance have a single-cycle multiply, but for division, "DIV" is a single division iteration—you need to do it for every digit you need to generate. It's just the nature of the operation. Compilers may help you optimize constants, but it's always best to keep track of things yourself so you know what you're getting. So, yes, multiply by the sample period instead of dividing by the sample rate, etc. -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 9/03/2013 2:55 PM, Ross Bencina wrote: Note that audio faders are not linear in decibels either, e.g.: http://iub.edu/~emusic/etext/studio/studio_images/mixer9.jpg There is some discussion here: http://www.kvraudio.com/forum/viewtopic.php?t=348751 Ross. -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 9/03/2013 9:53 AM, ChordWizard Software wrote: Maybe you can advise me on a related question - what's the best approach to implementing attenuation? I'm guessing it is not linear, since perceived sound loudness has a logarithmic profile - or am I confusing amplifier wattage with signal amplitude? What I do is use a linear scaling value internally -- that's the number that multiplies the signal. Let's call it linearGain. linearGain has the value 1.0 for unity gain and 0.0 for infinite attenuation. there is usually some mapping from "userGain" linearGain = f( userGain ); If userGain is expressed in decibels you can use the standard decibel to amplitude mapping: linearGain = 10 ^ (gainDb / 20.) If your input is MIDI master volume you have to map from the MIDI value range to linear gain (perhaps via decibels). Maybe there is a standard curve for this? Note that audio faders are not linear in decibels either, e.g.: http://iub.edu/~emusic/etext/studio/studio_images/mixer9.jpg Ross. -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 3/8/13 5:53 PM, ChordWizard Software wrote: Ross, Alan, Robert, thanks for the comments. It's all good sense and very helpful as a reality check. I had considered some of these concepts already, it's good to get these validated (and expanded). But some are quite new - I never realised that multiplication ops were more efficient than divisions. always multiply by 1/C rather than divide by C. if C is a constant. it means your "coefficient cooking" code has to compute the reciprocal, but that is not something that need be done at sample time but in the code that gets executed when a knob is twisted. I totally agree with the idea that macro efficiency is a more rewarding starting point than micro. I generally do try to avoid copy routines that don't add any other value to the process at the same time. But it's a tradeoff, isn't it, between efficiency and trying to keep the code modular what i don't understand is why your modular code needs to make unnecessary copy operations. *every* instantiation of every module owns its own output buffers. and the inputs to every module are other modules' outputs (or the same module if you wanna do some delayed feedback). why and when do you need to copy? well, other than into a delay line buffer (like for FIR or multitap or reverb or similar). but that is an integral function of the module to begin with. with the system I/O i can surely imagine the need to copy out of the system buffer to some nice de-interleaved signal buffers. and if your system is floating point, it makes sense to me to convert from fixed (what comes from the A/D buffer) to float and detangle the left and right channel samples. and if there is a global input gain knob, to apply that gain on the samples as they are being passed from one buffer into another. that's a piece of system code, not part of a module that may or may not be instantiated. enough that you don't end up with some arcane multi-op tangle that has to get duplicated and tweaked for every special case. Anyway, if the general consensus is that memset and memcpy are reasonably efficient then that's my immediate need taken care of, as I’m trying very hard to stay cross-platform ready. Maybe you can advise me on a related question - what's the best approach to implementing attenuation? I'm guessing it is not linear, since perceived sound loudness has a logarithmic profile - or am I confusing amplifier wattage with signal amplitude? i've never understood "attenuation" being anything other than a gain coefficient with magnitude less than 1. inside your DSP engine, "amplitude" is just a number (but we often like to have the rails defined at -1 and +1), and when that signal goes out into an amplifier and loud speaker, there can be talk of "wattage" in an absolute sense. but inside your alg, only relative wattage makes any sense. at least to me (maybe i'm missing something, like an obscure standard). multiply your signal by a gain coefficient equal to 1/2 (or -1/2) and your voltage level (and r.m.s. voltage) in the amp drops to half, your wattage drops to 1/4 of the previous level and it's a -6.02 dB change. The design of my audio engine is to drive a default GM softsynth, are you coding the softsynth? or hooking up to someone else's? with optional overrides for each channel to use a VSTi or alternate synth/font instead. Sysex Master Volume support is by no means assured for all of these possible outputs, particularly the VSTs, so I'm realising that I probably need to implement my own master volume control at the output. well you're system output samples come from the output buffer that is owned by the module that is connected to the system output. at the end of your block processing time, after all of the modules got to process their input into their outputs, your system has a pointer to where the output blocks are and as you fetch those samples, you might have to interlace the samples from multiple channels, you might have to convert from float to fixed, and it appears to me that you might want to apply that Master Volume gain just before the float-to-fixed conversion. The obvious approach of course is linear scaling, but something tells me there might be a better way to balance the increments of perceived volume difference across the whole range? dunno what that is. a dB step issue? -- r b-j r...@audioimagination.com "Imagination is more important than knowledge." -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
[music-dsp] Efficiency of clear/copy/offset buffers
Ross, Alan, Robert, thanks for the comments. It's all good sense and very helpful as a reality check. I had considered some of these concepts already, it's good to get these validated (and expanded). But some are quite new - I never realised that multiplication ops were more efficient than divisions. I totally agree with the idea that macro efficiency is a more rewarding starting point than micro. I generally do try to avoid copy routines that don't add any other value to the process at the same time. But it's a tradeoff, isn't it, between efficiency and trying to keep the code modular enough that you don't end up with some arcane multi-op tangle that has to get duplicated and tweaked for every special case. Anyway, if the general consensus is that memset and memcpy are reasonably efficient then that's my immediate need taken care of, as Im trying very hard to stay cross-platform ready. Maybe you can advise me on a related question - what's the best approach to implementing attenuation? I'm guessing it is not linear, since perceived sound loudness has a logarithmic profile - or am I confusing amplifier wattage with signal amplitude? The design of my audio engine is to drive a default GM softsynth, with optional overrides for each channel to use a VSTi or alternate synth/font instead. Sysex Master Volume support is by no means assured for all of these possible outputs, particularly the VSTs, so I'm realising that I probably need to implement my own master volume control at the output. The obvious approach of course is linear scaling, but something tells me there might be a better way to balance the increments of perceived volume difference across the whole range? Regards, Stephen Clarke Managing Director ChordWizard Software Pty Ltd corpor...@chordwizard.com http://www.chordwizard.com ph: (+61) 2 4960 9520 fax: (+61) 2 4960 9580 -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
On 3/7/13 10:11 PM, Alan Wolfe wrote: Quick 2 cents of my own to re-emphasize a point that Ross made - profile to find out which is fastest if you aren't sure (although it's good to ask too in case different systems have different oddities you don't know about) Also, if in the future you have performance issues, profile before acting for maximum efficiency... often times what we suspect to be the bottleneck of our application is in fact not the bottleneck at all. Happens to everyone :P lastly, copying buffers is an important thing to get right, but in case you haven't heard this enough, when hitting performance problems it's often better to do MACRO optimization instead of MICRO optimization. Macro optimization means changing your algorithm, being smarter with the resources you have etc. Micro optimization means turning multiplications into bitshifts, breaking out the assembly and things like that. one thing that makes sense to me, when i was worrying about this, was to try to do a few different tasks together in the same operation at a system level. here's a case in point: in some previous product that will go unnamed because i don't want anyone pissed at me for "revealing state secrets", the product had multichannel in and multichannel out. the samples in the A/D and D/A DMA buffers were interlaced, fixed point, and scaled for the I/O device. but we wanted the different channel buffers to not be interlaced for the internal algs and we wanted the data be converted to floating point (i don't like floating point so much, but the processor was float and the decision was made by bigger people than me that all the algs were to be floating point), and there were user-definable global gains going in and coming out of the box. so i wrote (in assembly) a simple de-interlace, copy, scale, and convert-to-float of the samples going in, and the reverse of all of that for the samples going out. doing all four operations together cost about the same as just copying the data when done in assembly. maybe some setup overhead, but the sample was yanked from one buffer, converted to float, multiplied by the global gain, and stored into one of multiple other buffers. and going out was the reverse. in between, the sorta user-defined algs were mono or multichannel, but looked at each channel as just another mono signal in a block or buffer that didn't have any confusing interleaving (no "stride" needed, unless it was a crude down-sampler and that was part of the alg definition, but the algs never had to think about skipping over other channels' samples). -- r b-j r...@audioimagination.com "Imagination is more important than knowledge." -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] Efficiency of clear/copy/offset buffers
Quick 2 cents of my own to re-emphasize a point that Ross made - profile to find out which is fastest if you aren't sure (although it's good to ask too in case different systems have different oddities you don't know about) Also, if in the future you have performance issues, profile before acting for maximum efficiency... often times what we suspect to be the bottleneck of our application is in fact not the bottleneck at all. Happens to everyone :P lastly, copying buffers is an important thing to get right, but in case you haven't heard this enough, when hitting performance problems it's often better to do MACRO optimization instead of MICRO optimization. Macro optimization means changing your algorithm, being smarter with the resources you have etc. Micro optimization means turning multiplications into bitshifts, breaking out the assembly and things like that. Often times macro optimizations will get you a bigger win (don't optimize a crappy sorting algorithm, just use a better sorting algorithm and it'll be way better) and also will result in more maintainable, portable code, so you should prefer going that route first. Hope this helps! On Thu, Mar 7, 2013 at 2:48 PM, Ross Bencina wrote: > Stephen, > > > On 8/03/2013 9:29 AM, ChordWizard Software wrote: >> >> a) additive mixing of audio buffers b) clearing to zero before >> additive processing > > > You could also consider writing (rather than adding) the first signal to the > buffer. That way you don't have to zero it first. It requires having a > "write" and an "add" version of your generators. Depending on your code this > may or may not be worth the trouble vs zeroing first. > > In the past I've sometimes used C++ templates to paramaterise by the output > operation (write/add) so you only have to write the code that generates the > signals once > > > c) copying from one buffer to another > > Of course you should avoid this whereever possible. Consider using > (reference counted) buffer objects so you can share them instead of copying > data. You could use reference counting, or just reclaim everything at the > end of every cycle. > > > > d) converting between short and float formats >> >> >> No surprises to any of you there I'm sure. My question is, can you >> give me a few pointers about making them as efficient as possible >> within that critical realtime loop? >> >> For example, how does the efficiency of memset, or ZeroMemory, >> compare to a simple for loop? > > > Usually memset has a special case for writing zeros, so you shouldn't see > too much difference between memset and ZeroMemory. > > memset vs simple loop will depend on your compiler. > > The usual wisdom is: > > 1) use memset vs writing your own. the library implementation will use > SSE/whatever and will be fast. Of course this depends on the runtime > > 2) always profile and compare if you care. > > > >> Or using HeapAlloc with the >> HEAP_ZERO_MEMORY flag when the buffer is created (I know buffers >> shouldn’t be allocated in a realtime callback, but just out of >> interest, I assume an initial zeroing must come at a cost compared to >> not using that flag)? > > > It could happen in a few ways, but I'm not sure how it *does* happen on > Windows and OS X. > > For example the MMU could map all the pages to a single zero page and then > allocate+zero only when there is a write to the page. > > > >> I'm using Win32 but intend to port to OSX as well, so comments on the >> merits of cross-platform options like the C RTL would be particularly >> helpful. I realise some of those I mention above are Win-specific. >> >> Also for converting sample formats, are there more efficient options >> than simply using >> >> nFloat = (float)nShort / 32768.0 > > > Unless you have a good reason not to you should prefer multiplication by > reciprocal for the first one > > const float scale = (float)(1. / 32768.0); > nFloat = (float)nShort * scale; > > You can do 4 at once if you use SSE/intrinsics. > > >> nShort = (short)(nFloat * 32768.0) > > Float => int conversion can be expensive depending on your compiler settings > and supported processor architectures. There are various ways around this. > > Take a look at pa_converters.c and the pa_x86_plain_converters.c in > PortAudio. But you can do better with SSE. > > > >> for every sample? >> >> Are there any articles on this type of optimisation that can give me >> some insight into what is happening behind the various memory >> management calls? > > > Probably. I would make sure you allocate aligned memory, maybe lock it in > physical memory, and then use it -- and generally avoid OS-level memory > calls from then on. > > I would use memset() memcpy(). These are optimised and the compiler may even > inline an even more optimal version. > > The alternative is to go low-level and benchmark everything and write your > own code in SSE (and learn how to optimise it). > > If you really care you need a good profiler. > > That's my 2c. > > HTH > > Ross. >
Re: [music-dsp] Efficiency of clear/copy/offset buffers
Stephen, On 8/03/2013 9:29 AM, ChordWizard Software wrote: a) additive mixing of audio buffers b) clearing to zero before additive processing You could also consider writing (rather than adding) the first signal to the buffer. That way you don't have to zero it first. It requires having a "write" and an "add" version of your generators. Depending on your code this may or may not be worth the trouble vs zeroing first. In the past I've sometimes used C++ templates to paramaterise by the output operation (write/add) so you only have to write the code that generates the signals once c) copying from one buffer to another Of course you should avoid this whereever possible. Consider using (reference counted) buffer objects so you can share them instead of copying data. You could use reference counting, or just reclaim everything at the end of every cycle. d) converting between short and float formats No surprises to any of you there I'm sure. My question is, can you give me a few pointers about making them as efficient as possible within that critical realtime loop? For example, how does the efficiency of memset, or ZeroMemory, compare to a simple for loop? Usually memset has a special case for writing zeros, so you shouldn't see too much difference between memset and ZeroMemory. memset vs simple loop will depend on your compiler. The usual wisdom is: 1) use memset vs writing your own. the library implementation will use SSE/whatever and will be fast. Of course this depends on the runtime 2) always profile and compare if you care. Or using HeapAlloc with the HEAP_ZERO_MEMORY flag when the buffer is created (I know buffers shouldn’t be allocated in a realtime callback, but just out of interest, I assume an initial zeroing must come at a cost compared to not using that flag)? It could happen in a few ways, but I'm not sure how it *does* happen on Windows and OS X. For example the MMU could map all the pages to a single zero page and then allocate+zero only when there is a write to the page. I'm using Win32 but intend to port to OSX as well, so comments on the merits of cross-platform options like the C RTL would be particularly helpful. I realise some of those I mention above are Win-specific. Also for converting sample formats, are there more efficient options than simply using nFloat = (float)nShort / 32768.0 Unless you have a good reason not to you should prefer multiplication by reciprocal for the first one const float scale = (float)(1. / 32768.0); nFloat = (float)nShort * scale; You can do 4 at once if you use SSE/intrinsics. > nShort = (short)(nFloat * 32768.0) Float => int conversion can be expensive depending on your compiler settings and supported processor architectures. There are various ways around this. Take a look at pa_converters.c and the pa_x86_plain_converters.c in PortAudio. But you can do better with SSE. for every sample? Are there any articles on this type of optimisation that can give me some insight into what is happening behind the various memory management calls? Probably. I would make sure you allocate aligned memory, maybe lock it in physical memory, and then use it -- and generally avoid OS-level memory calls from then on. I would use memset() memcpy(). These are optimised and the compiler may even inline an even more optimal version. The alternative is to go low-level and benchmark everything and write your own code in SSE (and learn how to optimise it). If you really care you need a good profiler. That's my 2c. HTH Ross. Regards, Stephen Clarke Managing Director ChordWizard Software Pty Ltd corpor...@chordwizard.com http://www.chordwizard.com ph: (+61) 2 4960 9520 fax: (+61) 2 4960 9580 -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
[music-dsp] Efficiency of clear/copy/offset buffers
Greetings, and apologies in advance for bringing up what must be a well-covered topic on this list, I just couldn't find it in the archives anywhere. I'm in the final stages of building an audio host/synth engine in C++, and of course a large part of its realtime workload is building and transferring audio buffers: a) additive mixing of audio buffers b) clearing to zero before additive processing c) copying from one buffer to another d) converting between short and float formats No surprises to any of you there I'm sure. My question is, can you give me a few pointers about making them as efficient as possible within that critical realtime loop? For example, how does the efficiency of memset, or ZeroMemory, compare to a simple for loop? Or using HeapAlloc with the HEAP_ZERO_MEMORY flag when the buffer is created (I know buffers shouldnt be allocated in a realtime callback, but just out of interest, I assume an initial zeroing must come at a cost compared to not using that flag)? I'm using Win32 but intend to port to OSX as well, so comments on the merits of cross-platform options like the C RTL would be particularly helpful. I realise some of those I mention above are Win-specific. Also for converting sample formats, are there more efficient options than simply using nFloat = (float)nShort / 32768.0 nShort = (short)(nFloat * 32768.0) for every sample? Are there any articles on this type of optimisation that can give me some insight into what is happening behind the various memory management calls? Regards, Stephen Clarke Managing Director ChordWizard Software Pty Ltd corpor...@chordwizard.com http://www.chordwizard.com ph: (+61) 2 4960 9520 fax: (+61) 2 4960 9580 -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp