Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-14 Thread Ross Bencina

On 12/03/2013 5:58 AM, Nigel Redmon wrote:

 // round up to nearest power of two
 unsigned int v = theSize;
 v--;// so we don't go up if already a power of 2
 v |= v >> 1;// roll the highest bit into all lower bits...
 v |= v >> 2;
 v |= v >> 4;
 v |= v >> 8;
 v |= v >> 16;
 v++;// and increment to power of 2


The "Hackers Delight" book is a good source for this type of thing:

http://www.hackersdelight.org/

Ross.
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-14 Thread Ross Bencina

On 15/03/2013 7:27 AM, Sampo Syreeni wrote:

Quite a number of processors have/used to have explicit support for
counted for loops. Has anybody tried masking against doing the inner
loop as a buffer-sized counted for and only worrying about the
wrap-around in an outer, second loop, the way we do it with unaligned
copies, SIMD and other forms of unrolling?


Yes. I usually do that when I can. I posted code earlier in the thread.

Doesn't work so well if your phase increment varies in non-simple ways 
(ie FM).


Ross.
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-14 Thread Ross Bencina

On 15/03/2013 6:02 AM, jpff wrote:

"Ross" == Ross Bencina  writes:

  Ross> I am suspicious about whether the mask is fast than the conditional for
  Ross> a couple of reasons:
  Ross> - branch prediction works well if the branch usually falls one way
  Ross> - cmove (conditional move instructions) can avoid an explicit branch
  Ross> Once again, you would want to benchmark.

I did the comparison for Csound a few months ago. The loss in using
modulus over mask was more than I could contemplate my users
accepting.  We provide both versions for those who want non-power-of-2
tables and can take the considerable hit (gcc 4, x86_64)


Hi John,

I just want to clarify whether we're talking about the same thing:

You wrote:

John> The loss in using modulus over mask

Do you mean :

x = x % 255 // modulus
x = x & 0xFF // mask

?

Because I wrote:

Ross> whether the mask is fast than the conditional

Ie:

x = x & 0x255 // modulus
if( x == 256 ) x = 0; // conditional


Note that I am referring to the case where the instruction set has CMOVE 
(On IA32 it was added with Pentium Pro I think).


Ross.
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-14 Thread Alan Wolfe
RBJ's response would fit into that category I think Sampo (:

On Thu, Mar 14, 2013 at 1:27 PM, Sampo Syreeni  wrote:
> On 2013-03-14, jpff wrote:
>
>> I did the comparison for Csound a few months ago. The loss in using
>> modulus over mask was more than I could contemplate my users accepting.
>
>
> Quite a number of processors have/used to have explicit support for counted
> for loops. Has anybody tried masking against doing the inner loop as a
> buffer-sized counted for and only worrying about the wrap-around in an
> outer, second loop, the way we do it with unaligned copies, SIMD and other
> forms of unrolling?
> --
> Sampo Syreeni, aka decoy - de...@iki.fi, http://decoy.iki.fi/front
> +358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
>
> --
> dupswapdrop -- the music-dsp mailing list and website:
> subscription info, FAQ, source code archive, list archive, book reviews, dsp
> links
> http://music.columbia.edu/cmc/music-dsp
> http://music.columbia.edu/mailman/listinfo/music-dsp
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-14 Thread Sampo Syreeni

On 2013-03-14, jpff wrote:

I did the comparison for Csound a few months ago. The loss in using 
modulus over mask was more than I could contemplate my users 
accepting.


Quite a number of processors have/used to have explicit support for 
counted for loops. Has anybody tried masking against doing the inner 
loop as a buffer-sized counted for and only worrying about the 
wrap-around in an outer, second loop, the way we do it with unaligned 
copies, SIMD and other forms of unrolling?

--
Sampo Syreeni, aka decoy - de...@iki.fi, http://decoy.iki.fi/front
+358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-14 Thread David Hoskins


On 14/03/2013 19:02, jpff wrote:

"Ross" == Ross Bencina  writes:

  Ross> I am suspicious about whether the mask is fast than the conditional for
  Ross> a couple of reasons:

  Ross> - branch prediction works well if the branch usually falls one way

  Ross> - cmove (conditional move instructions) can avoid an explicit branch

  Ross> Once again, you would want to benchmark.

I did the comparison for Csound a few months ago. The loss in using
modulus over mask was more than I could contemplate my users
accepting.  We provide both versions for those who want non-power-of-2
tables and can take the considerable hit (gcc 4, x86_64)

==John ffitch
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


I've never used Modulus in a ring buffer, mainly because I recoil in 
horror at every division I see! :)
If you have a lot of buffers, like in a reverb, I found it's best to 
only use the memory that's needed, when I replaced the ANDs with IFs my 
reverb was much more efficient.


I guess it depends on your uses, but my slowdown was caused by memory 
cache issues with the higher than needed buffer sizes.

Dave.






--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-14 Thread Alan Wolfe
I'm sure it varies from hardware to hardware too, so always good to
know your options

On Thu, Mar 14, 2013 at 12:02 PM, jpff  wrote:
>> "Ross" == Ross Bencina  writes:
>
>  Ross> I am suspicious about whether the mask is fast than the conditional for
>  Ross> a couple of reasons:
>
>  Ross> - branch prediction works well if the branch usually falls one way
>
>  Ross> - cmove (conditional move instructions) can avoid an explicit branch
>
>  Ross> Once again, you would want to benchmark.
>
> I did the comparison for Csound a few months ago. The loss in using
> modulus over mask was more than I could contemplate my users
> accepting.  We provide both versions for those who want non-power-of-2
> tables and can take the considerable hit (gcc 4, x86_64)
>
> ==John ffitch
> --
> dupswapdrop -- the music-dsp mailing list and website:
> subscription info, FAQ, source code archive, list archive, book reviews, dsp 
> links
> http://music.columbia.edu/cmc/music-dsp
> http://music.columbia.edu/mailman/listinfo/music-dsp
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-14 Thread jpff
> "Ross" == Ross Bencina  writes:

 Ross> I am suspicious about whether the mask is fast than the conditional for 
 Ross> a couple of reasons:

 Ross> - branch prediction works well if the branch usually falls one way

 Ross> - cmove (conditional move instructions) can avoid an explicit branch

 Ross> Once again, you would want to benchmark.

I did the comparison for Csound a few months ago. The loss in using
modulus over mask was more than I could contemplate my users
accepting.  We provide both versions for those who want non-power-of-2
tables and can take the considerable hit (gcc 4, x86_64)

==John ffitch
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-11 Thread Didier Dambrin
The most important reason to write clear & commented code is.. future 
yourself, anyway.
You're pretty much a stranger to your own code when you look at it years 
later.





-Message d'origine- 
From: robert bristow-johnson

Sent: Monday, March 11, 2013 9:38 PM
To: music-dsp@music.columbia.edu
Subject: Re: [music-dsp] Efficiency of clear/copy/offset buffers

On 3/11/13 4:25 PM, Theo Verelst wrote:
A lot of the considerations of course have to do with trying to make 
maintainable, and therefore readable code.


...

Of course fancy looking constructs are cool, probably in industry it is 
sometimes the only way to keep secrets from the competition,


ha-ha!  this was similar to what i was telling a certain director (you
guys would certainly recognize his name) of a certain synthesizer R&D
division in 2007 or 2008.

making sure you don't hire the mole is how you protect proprietary
code.  uncommented spaghetti code is a very stupid way to protect secret
code because even the good guys whom are hired to develop the code
further can't figure it out.  uncommented, poorly written spaghetti code
has a negative productivity measure.  you waste more time trying to
figure it out and how you will have to modify it than just writing
decent code to begin with.

--

r b-j  r...@audioimagination.com

"Imagination is more important than knowledge."



--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links

http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


-
Aucun virus trouve dans ce message.
Analyse effectuee par AVG - www.avg.fr
Version: 2012.0.2240 / Base de donnees virale: 2641/5664 - Date: 11/03/2013 


--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-11 Thread robert bristow-johnson

On 3/11/13 4:25 PM, Theo Verelst wrote:
A lot of the considerations of course have to do with trying to make 
maintainable, and therefore readable code.


...

Of course fancy looking constructs are cool, probably in industry it 
is sometimes the only way to keep secrets from the competition,


ha-ha!  this was similar to what i was telling a certain director (you 
guys would certainly recognize his name) of a certain synthesizer R&D 
division in 2007 or 2008.


making sure you don't hire the mole is how you protect proprietary 
code.  uncommented spaghetti code is a very stupid way to protect secret 
code because even the good guys whom are hired to develop the code 
further can't figure it out.  uncommented, poorly written spaghetti code 
has a negative productivity measure.  you waste more time trying to 
figure it out and how you will have to modify it than just writing 
decent code to begin with.


--

r b-j  r...@audioimagination.com

"Imagination is more important than knowledge."



--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-11 Thread Theo Verelst
A lot of the considerations of course have to do with trying to make 
maintainable, and therefore readable code. That's good, but it often is 
not all too clear what DSP constructs have to do with enums or digital 
logic definitions, even though of course the P in DSP is done by digital 
logic.


Various software packages, from C++ to "low level" assemly type code 
aren't clear much in what is being optimized here: code readability, 
programming efficiency, typing/coding efficiency, compiler preprocessor 
main compiler or linker efficiency (probably not a big factor in most 
DSP), code memory efficiency or code processing speed efficiency, and 
probably some more options (loadable module efficiency, dynamic linking, 
data structure sharing between processes, cache coherency management, 
etc etc.)


Of course fancy looking constructs are cool, probably in industry it is 
sometimes the only way to keep secrets from the competition, but, to 
stay with the title of the thread: I suppose in many ways clearing and 
copying buffers is generally not so much execution efficient in many 
cases, but can be neat, or prevent buffer swithching aritfacts, and of 
course be needed to fill memory contents for DMA-type processing.


With respect to the code neatness and "programmer" efficiency, I doubt 
much of it appeals to me. Usually the mix of the various efficiency 
tends to communicate to me "not enough programmer" intelligence" more 
than most other things (like "I program therefore I exist"). However 
important this may all be, just like certain "music" kinds I don't take 
the original and the well-intended concepts unseriously, but I'd like 
progress instead of technology conservation, though of course there is 
decent place for that in modern society. I tend to work with complicated 
"blocks". My Jack/Ladspa combination blocks are hard to manage (I use 
scripts) but can give out music processing I like. I am absolutely sure 
a lot of buffers on my I7 aren't used to the full extend of their 
potential at all, but hey, it is a lot of work to go down to the level 
of ants and make specific algorithms efficient. Also, when it isn't yet 
clear which algorithms and which code blocks are going to be important: 
I'll rather not humiliate myself with that.


Theo V.

--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-11 Thread robert bristow-johnson


actually a way to do it is to create these power-of-2 circular buffers 
at one level (like a higher "system" level) and allocate delay lines 
*within* an already-created circular buffer.  you might need more than 
one circular buffer if you have, inside your algorithm or system of 
algorithms, different sampling rates (because the pointers move at 
different rates).


so you might have something like

typedef struct {
int *bufferBase;
int sampleRate;
unsigned long indexMask;   // initialized to bufferSize-1
long lastAssignedIndex;  // initialized to bufferSize
} circularBuffer;

boolean initCircularBuffer(circularBuffer *thisCircularBuffer, unsigned 
long bufferSize, int thisSampleRate);
   // this malloc() the space, assigns the initial values, returns 0 if 
successful

   //  bufferSize must be a power of 2


then, each time you need a delay line:

typedef struct {
circularBuffer *thisBuffer;
unsigned long writeIndex;   // initted to 
lastAssignedIndex-maxDelayNeeded

unsigned long delay;// init to 0, i guess.
} delayLine;


boolean initDelayLine(circularBuffer *thisCircularBuffer, delayLine 
*thisDelay, unsigned long maxDelayNeeded);
   // this allocates space in the buffer at 
lastAssignedIndex-maxDelayNeeded and saves that back to 
lastAssignedIndex (and it must be at least 0 if no error).


maxDelayNeeded need not be a power of 2.   sampleRate can be left out if 
the programmer chooses to keep track of which circular buffer is running 
at which sampling rate.


just an idea.

--

r b-j  r...@audioimagination.com

"Imagination is more important than knowledge."





On 3/11/13 2:58 PM, Nigel Redmon wrote:

A way to do it at run time (the caller requests an arbitrary size, the routine 
enforces a power of 2):

 // round up to nearest power of two
 unsigned int v = theSize;
 v--;// so we don't go up if already a power of 2
 v |= v>>  1;// roll the highest bit into all lower bits...
 v |= v>>  2;
 v |= v>>  4;
 v |= v>>  8;
 v |= v>>  16;
 v++;// and increment to power of 2

This is especially handy when the necessary buffer size depends on some user 
preferences, etc., and there are many possibilities.

I'm sure you could something similar with template meta programming if it's 
something that is set at compile time, thereby not forcing the caller to jump 
through hoops, while still getting the most efficient implementation.


On Mar 11, 2013, at 11:24 AM, Alan Wolfe  wrote:

interesting idea about rounding up and letting multiple buffers using
the memory.  Very nice.

I just wanted to add on the front of enforcing powers of 2 sizes, the
way you have it where you pass in an integer and it understand that as
a power of 2 is nice but of course a little less intuitive to the user
than saying "i want 1024 samples".  They pass a 10 in and whenever
they see that number in the code, they have to spend time thinking or
remembering what it means.

another way that could be a nicety could be to use an enum to get the
best of both worlds

enum EBufferSizes
{
  //... etc
  kBufferSize_512 = 9,
  kBufferSize_1024 = 10,
  //.. etc
}

Sure they could just put in ints instead of using your enum (some
compilers might make warnings for that at least, or allow you to tell
them to make warnings for that), but it could be a nice step to making
the interface a little nicer while still having the safety / ease of
use in your example.

On Mon, Mar 11, 2013 at 11:06 AM, robert bristow-johnson
  wrote:

On 3/11/13 11:19 AM, Phil Burk wrote:

Regarding power-of-2 sized circular buffers, here is a handy way to
verify that a bufferSize parameter is actually a power-of-2:

int init_circular_buffer( int bufferSize, ... )
{
assert( (bufferSize&  (bufferSize-1)) == 0 )
...



might be silly, but a way to force the caller to constrain it to a power of
2 is:

init_circular_buffer( int logBufferSize, ... )
{
unsigned long bufferSize = 1L<<  logBufferSize;
unsigned long indexMask = bufferSize - 1;
...



On 3/11/13 2:59 AM, Nigel Redmon wrote:

Also a note that the modulo-by-AND indexing is built into some
processors—the 56K family, at least, as Robert knows well…buffers are the
next power of two higher than the space needed, and the masking happens for
free…


actually the 56K and other DSPs (like the SHArC) can do buffers of any size
below 32K. the 56K has a restriction that the base address of the buffer
must be an integer multiple of a power of 2 that is at least as big as the
bufferSize. the modulo arithmetic doesn't really happen for free. choosing
to use a DSP over a cheap ARM chip or something similar has both advantages
and disadvantages. and they have to put a bunch of logic on the chip for the
modulo. even the 563xx chip has that 32K restriction, even though the
address space increased to 16M. such a shame. you have minutes of addressing
space, but your modulo delay lines are still limited to less than a second
a

Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-11 Thread Nigel Redmon
A way to do it at run time (the caller requests an arbitrary size, the routine 
enforces a power of 2):

// round up to nearest power of two
unsigned int v = theSize;
v--;// so we don't go up if already a power of 2
v |= v >> 1;// roll the highest bit into all lower bits...
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v++;// and increment to power of 2

This is especially handy when the necessary buffer size depends on some user 
preferences, etc., and there are many possibilities.

I'm sure you could something similar with template meta programming if it's 
something that is set at compile time, thereby not forcing the caller to jump 
through hoops, while still getting the most efficient implementation.


On Mar 11, 2013, at 11:24 AM, Alan Wolfe  wrote:
> interesting idea about rounding up and letting multiple buffers using
> the memory.  Very nice.
> 
> I just wanted to add on the front of enforcing powers of 2 sizes, the
> way you have it where you pass in an integer and it understand that as
> a power of 2 is nice but of course a little less intuitive to the user
> than saying "i want 1024 samples".  They pass a 10 in and whenever
> they see that number in the code, they have to spend time thinking or
> remembering what it means.
> 
> another way that could be a nicety could be to use an enum to get the
> best of both worlds
> 
> enum EBufferSizes
> {
>  //... etc
>  kBufferSize_512 = 9,
>  kBufferSize_1024 = 10,
>  //.. etc
> }
> 
> Sure they could just put in ints instead of using your enum (some
> compilers might make warnings for that at least, or allow you to tell
> them to make warnings for that), but it could be a nice step to making
> the interface a little nicer while still having the safety / ease of
> use in your example.
> 
> On Mon, Mar 11, 2013 at 11:06 AM, robert bristow-johnson
>  wrote:
>> On 3/11/13 11:19 AM, Phil Burk wrote:
>>> 
>>> Regarding power-of-2 sized circular buffers, here is a handy way to
>>> verify that a bufferSize parameter is actually a power-of-2:
>>> 
>>> int init_circular_buffer( int bufferSize, ... )
>>> {
>>> assert( (bufferSize & (bufferSize-1)) == 0 )
>>> ...
>>> 
>> 
>> 
>> might be silly, but a way to force the caller to constrain it to a power of
>> 2 is:
>> 
>> init_circular_buffer( int logBufferSize, ... )
>> {
>> unsigned long bufferSize = 1L << logBufferSize;
>> unsigned long indexMask = bufferSize - 1;
>> ...
>> 
>> 
>> 
>> On 3/11/13 2:59 AM, Nigel Redmon wrote:
>>> 
>>> Also a note that the modulo-by-AND indexing is built into some
>>> processors—the 56K family, at least, as Robert knows well…buffers are the
>>> next power of two higher than the space needed, and the masking happens for
>>> free…
>> 
>> 
>> actually the 56K and other DSPs (like the SHArC) can do buffers of any size
>> below 32K. the 56K has a restriction that the base address of the buffer
>> must be an integer multiple of a power of 2 that is at least as big as the
>> bufferSize. the modulo arithmetic doesn't really happen for free. choosing
>> to use a DSP over a cheap ARM chip or something similar has both advantages
>> and disadvantages. and they have to put a bunch of logic on the chip for the
>> modulo. even the 563xx chip has that 32K restriction, even though the
>> address space increased to 16M. such a shame. you have minutes of addressing
>> space, but your modulo delay lines are still limited to less than a second
>> at any decent sampling rate.
>> 
>> but what you can do with C where you might have a bunch of different delay
>> lines (like in a Shroeder/Jot reverb), all running at the same sampling
>> rate, is create a *single* circular buffer that has length that is a power
>> of 2. then each little delay line can have a piece of that buffer allocated,
>> but all of the allocations move at the same rate. the various delay line
>> allocations are "stationary" relative to each other.
>> 
>> it can be compared to an analog tape delay like this. you have a fixed
>> amount of tape media but as many record and playback heads as your heart
>> desires. so instead of cutting a separate loop of tape (which has to be of
>> length equal to a power of two) and connect that up to a record and playback
>> head, you create one big loop of tape and put a record/playback head pair
>> for each delay on the tape loop at different locations.
>> 
>> that way you can efficiently allocate a delay line of 129 or 257 or 4097
>> samples long along with a bunch of others. only the whole big buffer need be
>> of length 2^p .
>> 
>> 
>> --
>> 
>> r b-j  r...@audioimagination.com
>> 
>> "Imagination is more important than knowledge."
>> 
>> 
>> 
>> --
>> dupswapdrop -- the music-dsp mailing list and website:
>> subscription info, FAQ, source code archive, list archive, book reviews, dsp
>> links
>> http://music.columbia.edu/cmc/music-dsp
>> http://music.columbia.edu/mailman/listinfo/music-dsp
> --
> dupswapdrop -- t

Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-11 Thread Alan Wolfe
interesting idea about rounding up and letting multiple buffers using
the memory.  Very nice.

I just wanted to add on the front of enforcing powers of 2 sizes, the
way you have it where you pass in an integer and it understand that as
a power of 2 is nice but of course a little less intuitive to the user
than saying "i want 1024 samples".  They pass a 10 in and whenever
they see that number in the code, they have to spend time thinking or
remembering what it means.

another way that could be a nicety could be to use an enum to get the
best of both worlds

enum EBufferSizes
{
  //... etc
  kBufferSize_512 = 9,
  kBufferSize_1024 = 10,
  //.. etc
}

Sure they could just put in ints instead of using your enum (some
compilers might make warnings for that at least, or allow you to tell
them to make warnings for that), but it could be a nice step to making
the interface a little nicer while still having the safety / ease of
use in your example.

On Mon, Mar 11, 2013 at 11:06 AM, robert bristow-johnson
 wrote:
> On 3/11/13 11:19 AM, Phil Burk wrote:
>>
>> Regarding power-of-2 sized circular buffers, here is a handy way to
>> verify that a bufferSize parameter is actually a power-of-2:
>>
>> int init_circular_buffer( int bufferSize, ... )
>> {
>> assert( (bufferSize & (bufferSize-1)) == 0 )
>> ...
>>
>
>
> might be silly, but a way to force the caller to constrain it to a power of
> 2 is:
>
> init_circular_buffer( int logBufferSize, ... )
> {
> unsigned long bufferSize = 1L << logBufferSize;
> unsigned long indexMask = bufferSize - 1;
> ...
>
>
>
> On 3/11/13 2:59 AM, Nigel Redmon wrote:
>>
>> Also a note that the modulo-by-AND indexing is built into some
>> processors—the 56K family, at least, as Robert knows well…buffers are the
>> next power of two higher than the space needed, and the masking happens for
>> free…
>
>
> actually the 56K and other DSPs (like the SHArC) can do buffers of any size
> below 32K. the 56K has a restriction that the base address of the buffer
> must be an integer multiple of a power of 2 that is at least as big as the
> bufferSize. the modulo arithmetic doesn't really happen for free. choosing
> to use a DSP over a cheap ARM chip or something similar has both advantages
> and disadvantages. and they have to put a bunch of logic on the chip for the
> modulo. even the 563xx chip has that 32K restriction, even though the
> address space increased to 16M. such a shame. you have minutes of addressing
> space, but your modulo delay lines are still limited to less than a second
> at any decent sampling rate.
>
> but what you can do with C where you might have a bunch of different delay
> lines (like in a Shroeder/Jot reverb), all running at the same sampling
> rate, is create a *single* circular buffer that has length that is a power
> of 2. then each little delay line can have a piece of that buffer allocated,
> but all of the allocations move at the same rate. the various delay line
> allocations are "stationary" relative to each other.
>
> it can be compared to an analog tape delay like this. you have a fixed
> amount of tape media but as many record and playback heads as your heart
> desires. so instead of cutting a separate loop of tape (which has to be of
> length equal to a power of two) and connect that up to a record and playback
> head, you create one big loop of tape and put a record/playback head pair
> for each delay on the tape loop at different locations.
>
> that way you can efficiently allocate a delay line of 129 or 257 or 4097
> samples long along with a bunch of others. only the whole big buffer need be
> of length 2^p .
>
>
> --
>
> r b-j  r...@audioimagination.com
>
> "Imagination is more important than knowledge."
>
>
>
> --
> dupswapdrop -- the music-dsp mailing list and website:
> subscription info, FAQ, source code archive, list archive, book reviews, dsp
> links
> http://music.columbia.edu/cmc/music-dsp
> http://music.columbia.edu/mailman/listinfo/music-dsp
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-11 Thread robert bristow-johnson

On 3/11/13 11:19 AM, Phil Burk wrote:

Regarding power-of-2 sized circular buffers, here is a handy way to
verify that a bufferSize parameter is actually a power-of-2:

int init_circular_buffer( int bufferSize, ... )
{
assert( (bufferSize & (bufferSize-1)) == 0 )
...




might be silly, but a way to force the caller to constrain it to a power 
of 2 is:


init_circular_buffer( int logBufferSize, ... )
{
unsigned long bufferSize = 1L << logBufferSize;
unsigned long indexMask = bufferSize - 1;
...


On 3/11/13 2:59 AM, Nigel Redmon wrote:

Also a note that the modulo-by-AND indexing is built into some processors—the 
56K family, at least, as Robert knows well…buffers are the next power of two 
higher than the space needed, and the masking happens for free…


actually the 56K and other DSPs (like the SHArC) can do buffers of any 
size below 32K. the 56K has a restriction that the base address of the 
buffer must be an integer multiple of a power of 2 that is at least as 
big as the bufferSize. the modulo arithmetic doesn't really happen for 
free. choosing to use a DSP over a cheap ARM chip or something similar 
has both advantages and disadvantages. and they have to put a bunch of 
logic on the chip for the modulo. even the 563xx chip has that 32K 
restriction, even though the address space increased to 16M. such a 
shame. you have minutes of addressing space, but your modulo delay lines 
are still limited to less than a second at any decent sampling rate.


but what you can do with C where you might have a bunch of different 
delay lines (like in a Shroeder/Jot reverb), all running at the same 
sampling rate, is create a *single* circular buffer that has length that 
is a power of 2. then each little delay line can have a piece of that 
buffer allocated, but all of the allocations move at the same rate. the 
various delay line allocations are "stationary" relative to each other.


it can be compared to an analog tape delay like this. you have a fixed 
amount of tape media but as many record and playback heads as your heart 
desires. so instead of cutting a separate loop of tape (which has to be 
of length equal to a power of two) and connect that up to a record and 
playback head, you create one big loop of tape and put a record/playback 
head pair for each delay on the tape loop at different locations.


that way you can efficiently allocate a delay line of 129 or 257 or 4097 
samples long along with a bunch of others. only the whole big buffer 
need be of length 2^p .


--

r b-j  r...@audioimagination.com

"Imagination is more important than knowledge."



--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-11 Thread Phil Burk
Regarding power-of-2 sized circular buffers, here is a handy way to 
verify that a bufferSize parameter is actually a power-of-2:


int init_circular_buffer( int bufferSize, ... )
{
  assert( (bufferSize & (bufferSize-1)) == 0 )

Phil Burk
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-11 Thread Theo Verelst


>>On Mar 9, 2013, at 9:23 PM, robert bristow-johnson >>audioimagination.com> wrote:
>> if it's a wavetable and you're doing linear interpolation, a simple 
>>trick is to copy x[0] to x[256] (make a 256 point wavetable a 257 
>>element array) do the AND only for the first sample in the linear 
>>interpolation, the follow sample will always just follow and you need 
>>not AND the index for that second sample.


>I used that technique for decades on packet-basic queues also, adding 
>max-1 to the buffer—for instance, MIDI manager had a maximum packet 
>size of 256, so extending the queue by 255 let me copy any packet 
>without checking for wrap in my ancient MIDI stuff.


Yeah, it's always a bit of a hassle when making sure the communications 
processor, Unix or otherwise processor computing a memory or paging 
segment index (issuessince the 70s), or,like it was a hobby of mine in 
the early 80s, when a microprocessor would cycle an audio (sample) 
fragment gracefully and efficient.


Mind that the whole discussion should mention the efficiency versus the 
elegance of the solution: the code for a "modulo + bit-and" can be 
short, easy to debug on the one hand, it can execute efficient on the 
other hand, combined with that the infrastructure of the processor that 
runs it. However, often the memory architecture and memory access 
(in-)efficiency are more a perfromance bottleneck than most other 
things, so unless you access a fast IO device the whole technique 
probably has only marginal relevance, and probably doesn;t prevent 
response time jitter (as a conditional statement and of course 
cache-data-dependency and precise interrupt start-timing deviation 
because of the "current instruction" issues can do).


I'm glad the risk for mangling ants instead of finding universal truths 
in this computer design issue in this time is limited, because of the 
return of micro-programming  hardware to the point of even defining 
operations completely with random hardware in FPGA allowing for instance 
in a (Xilinx) Zinq processor to make  an efficient DMA+processing unit 
to combine efficiently with a ARM core. Of course ease of programming" 
or the option of ripping some Open Source code is also an issue, 
sometimes for the not honor-challenged.


T.Verelst
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-10 Thread Nigel Redmon
On Mar 9, 2013, at 9:23 PM, robert bristow-johnson  
wrote:
> if it's a wavetable and you're doing linear interpolation, a simple trick is 
> to copy x[0] to x[256] (make a 256 point wavetable a 257 element array) do 
> the AND only for the first sample in the linear interpolation, the follow 
> sample will always just follow and you need not AND the index for that second 
> sample.

I used that technique for decades on packet-basic queues also, adding max-1 to 
the buffer—for instance, MIDI manager had a maximum packet size of 256, so 
extending the queue by 255 let me copy any packet without checking for wrap in 
my ancient MIDI stuff.

Also a note that the modulo-by-AND indexing is built into some processors—the 
56K family, at least, as Robert knows well…buffers are the next power of two 
higher than the space needed, and the masking happens for free…

--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-10 Thread Richard Dobson

On 10/03/2013 05:23, robert bristow-johnson wrote:
..

ANDing with 2^p - 1 is a well-known and oft-used technique in C.
probably the best way to do it in C.


...


if it's a wavetable and you're doing linear interpolation, a simple
trick is to copy x[0] to x[256] (make a 256 point wavetable a 257
element array) do the AND only for the first sample in the linear
interpolation, the follow sample will always just follow and you need
not AND the index for that second sample.




There are useful examples of this in the Csound codebase, specifically 
the "oscili" family of opcodes, based on reading a function table where 
the extra element at the end is called a "guard point". For various 
reasons, Csound was extended a while back to allow non power-of-two 
table sizes, but the original opcodes are preserved. I included an 
discussion (with some standalone code) of the Csound oscillator in the 
Audio Programming Book, in the context of the description of the C 
bitwise operators. The "oscili" opcode is also interesting for the way 
it handles the fractional part of the interpolation using integer 
operations in the manner of a fixed-point computation; all of the above 
conspiring to make the Csound oscillator famously fast. There are 
reasons enough to have conditional tests inside a per-sample loop, but 
where they can be avoided, significant speedups can be achieved.


Richard Dobson

--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-10 Thread Tim Blechmann
>>> Though recent gcc versions will replace the above "a/3.14" with a
>>> multiplication, I remember a case where the denominator was constant
>>> as well but not quite as explicitly stated, where gcc 4.x produced a
>>> division instruction.
>>
>> not necessarily: in floating point math a/b and a * (1/b) do not yield
>> the same result. therefore the compile should not optimize this, unless
>> explicitly asked to do so (-freciprocal-math)
> 
> I should have added, "when employing the usual suspects, -ffast-math
> -O6 etc, as you usually would when compiling DSP code".  Sorry!

-ffast-math should not be used without knowing what it actually does, as
it might break code in subtle ways. e.g. supercollider makes use of NaNs
to represent an asynchronous `demand rate'. this breaks, if the compiler
assumes that all floating point math is finite ...

tim
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-09 Thread Ross Bencina


On 10/03/2013 3:51 PM, Alan Wolfe wrote:
> index = index + 1;
> if (index >= count)
>   index = 0;



Another, more compact way could be to do it this way:
index = (index + 1) % count;



I am suspicious about whether the mask is fast than the conditional for 
a couple of reasons:


- branch prediction works well if the branch usually falls one way

- cmove (conditional move instructions) can avoid an explicit branch

Once again, you would want to benchmark.



There's a neat technique to do this faster that I have to admit i got
from Ross's code a few years ago in his audio library PortAudio.


Probably I learnt that from Phil Burk. He is the master of bitmasks and 
circular buffers. There are also some other tricks in the PA code beyond 
what you mention here too...




That
technique requires that your circular buffer is a power of 2, but so
long as that is true, you can do an AND to get the remainder of the
division.  AND is super fast (even faster than the if / set) so it's a
great improvement.

How you do that looks like the below, assuming that your circular
buffer is 1024 samples large:
index = ((index + 1) & 1023);   // 1023 is just 1024-1

if your buffer was 256 samples large it would look like this:
index = ((index + 1) & 255); // 255 is just 256 - 1

Super useful trick so wanted to share it with ya (:



My preferred technique is to avoid tests and masks in the inner loop by 
precomputing the loop length and hoisting the tests:


int samplesToProcess = ?

int i=0;
while( samplesToProcess > 0 ){
  int samplesToEndOfBuffer = bufferSize - index;
  int n = min( samplesToEndOfBuffer, samplesToProcess );
  for( int j=0; j < n; ++j ){

output[i++] = buffer[ index++ ];
  }

  if( index == count )
index = 0; // wrap index

  samplesToProcess -= n;
}

this way the index is only ever tested/incremented outside the loop (no 
masking or no conditionals in the inner loop).


So long as the increment is significantly shorter than the buffer length 
you can make it work for non-integer increments too.


You can do this with more than one test (hoisting multiple unrelated 
conditionals from the inner loop).


Ross.



On Sat, Mar 9, 2013 at 12:14 PM, Tim Goetze  wrote:

[Tim Blechmann]

Though recent gcc versions will replace the above "a/3.14" with a
multiplication, I remember a case where the denominator was constant
as well but not quite as explicitly stated, where gcc 4.x produced a
division instruction.


not necessarily: in floating point math a/b and a * (1/b) do not yield
the same result. therefore the compile should not optimize this, unless
explicitly asked to do so (-freciprocal-math)


I should have added, "when employing the usual suspects, -ffast-math
-O6 etc, as you usually would when compiling DSP code".  Sorry!

Tim
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp

--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-09 Thread Ross Bencina

On 10/03/2013 7:01 AM, Tim Goetze wrote:

[robert bristow-johnson]

>On 3/9/13 1:31 PM, Wen Xue wrote:

>>I think one can trust the compiler to handle a/3.14 as a multiplication. If it
>>doesn't it'd probably be worse to write a*(1/3.14), for this would be a
>>division AND a multiplication.

>
>there are some awful crappy compilers out there.  even ones that start from gnu
>and somehow become a product sold for use with some DSP.

Though recent gcc versions will replace the above "a/3.14" with a
multiplication, I remember a case where the denominator was constant
as well but not quite as explicitly stated, where gcc 4.x produced a
division instruction.


I don't think this has anything to do with "crappy compilers"

Unless multiplication by reciprocal gives exactly the same result -- 
with the same precision and the same rounding behavior and the same 
denormal behavior etc then it would be *incorrect* to automatically 
replace division by multiplicaiton by reciprocal.


So I think it's more a case of conformant compilers, not crappy compilers.

I have always assumed that it is not (in general) valid for the compiler 
to automatically perform the replacement; and the ony reason we can get 
away with it is because we make certain simplifying assumptions.


Ross.
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-09 Thread robert bristow-johnson

On 3/9/13 11:51 PM, Alan Wolfe wrote:

There's a neat technique to do this faster that I have to admit i got
from Ross's code a few years ago in his audio library PortAudio.  That
technique requires that your circular buffer is a power of 2, but so
long as that is true, you can do an AND to get the remainder of the
division.  AND is super fast (even faster than the if / set) so it's a
great improvement.

How you do that looks like the below, assuming that your circular
buffer is 1024 samples large:
index = ((index + 1)&  1023);   // 1023 is just 1024-1

if your buffer was 256 samples large it would look like this:
index = ((index + 1)&  255); // 255 is just 256 - 1

Super useful trick so wanted to share it with ya (:


ANDing with 2^p - 1 is a well-known and oft-used technique in C.  
probably the best way to do it in C.


to have to bit-wise AND by 0x00FF (or whatever) for every time a sample 
is moved into or fetched from the delay line can sometimes be 
burdensome.  i have seen C code for a simple FIR filter that performs 
that AND only once per sample.  there are two loops (one after the 
other) to sum in the taps.  otherwise, i dunno how to ever avoid the 
bitwise AND.


if it's a wavetable and you're doing linear interpolation, a simple 
trick is to copy x[0] to x[256] (make a 256 point wavetable a 257 
element array) do the AND only for the first sample in the linear 
interpolation, the follow sample will always just follow and you need 
not AND the index for that second sample.


you can extend the idea more for higher order interpolation like with a 
precision delay.  let's say you have a 4096 sample delay buffer and your 
interpolation needs 16 samples for a good band-limited, sinc-like 
interpolation.  then whatever samples that go into x[0] through x[15], 
must also be copied over to x[4096] through x[4111].  but your initial 
index would always be masked by (4096-1).  it's just that when you tear 
through your interpolation, you need not worry about the indices of the 
16 samples you would be using in your convolution.  but worst case 
timing has it that you have to copy those 16 samples at the correct time.


--

r b-j  r...@audioimagination.com

"Imagination is more important than knowledge."



--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-09 Thread Alan Wolfe
Hey while we are on the topic of efficiency and the OP not knowing
that division was slower...

Often times in DSP you'll use circular buffers (like for delay buffers
for instance).

Those are often implemented by having an array, and an index into the
array for where the next sample should go.  When you put a sample into
the buffer, you increment the index and then make sure that if the
index into the array is out of bounds, that it gets set back to zero
so that it continually goes through the array in a circular fashion
(thus the name circular buffer!).

Incrementing the index could be implemented like this:

index = index + 1;
if (index >= count)
  index = 0;

Another, more compact way could be to do it this way:
index = (index + 1) % count;

In that last one, it uses the modulo operator to get the remainder of
a division to make sure the index is within range.  The modulo
operator has to pay the full cost of the divide though to figure out
the remainder so it is the same cost as a division (talked about
earlier!).

There's a neat technique to do this faster that I have to admit i got
from Ross's code a few years ago in his audio library PortAudio.  That
technique requires that your circular buffer is a power of 2, but so
long as that is true, you can do an AND to get the remainder of the
division.  AND is super fast (even faster than the if / set) so it's a
great improvement.

How you do that looks like the below, assuming that your circular
buffer is 1024 samples large:
index = ((index + 1) & 1023);   // 1023 is just 1024-1

if your buffer was 256 samples large it would look like this:
index = ((index + 1) & 255); // 255 is just 256 - 1

Super useful trick so wanted to share it with ya (:

On Sat, Mar 9, 2013 at 12:14 PM, Tim Goetze  wrote:
> [Tim Blechmann]
>>> Though recent gcc versions will replace the above "a/3.14" with a
>>> multiplication, I remember a case where the denominator was constant
>>> as well but not quite as explicitly stated, where gcc 4.x produced a
>>> division instruction.
>>
>>not necessarily: in floating point math a/b and a * (1/b) do not yield
>>the same result. therefore the compile should not optimize this, unless
>>explicitly asked to do so (-freciprocal-math)
>
> I should have added, "when employing the usual suspects, -ffast-math
> -O6 etc, as you usually would when compiling DSP code".  Sorry!
>
> Tim
> --
> dupswapdrop -- the music-dsp mailing list and website:
> subscription info, FAQ, source code archive, list archive, book reviews, dsp 
> links
> http://music.columbia.edu/cmc/music-dsp
> http://music.columbia.edu/mailman/listinfo/music-dsp
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-09 Thread Tim Goetze
[Tim Blechmann]
>> Though recent gcc versions will replace the above "a/3.14" with a
>> multiplication, I remember a case where the denominator was constant
>> as well but not quite as explicitly stated, where gcc 4.x produced a
>> division instruction.
>
>not necessarily: in floating point math a/b and a * (1/b) do not yield
>the same result. therefore the compile should not optimize this, unless
>explicitly asked to do so (-freciprocal-math)

I should have added, "when employing the usual suspects, -ffast-math
-O6 etc, as you usually would when compiling DSP code".  Sorry!

Tim
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-09 Thread Tim Blechmann
>>> I think one can trust the compiler to handle a/3.14 as a multiplication. If 
>>> it
>>> >> doesn't it'd probably be worse to write a*(1/3.14), for this would be a
>>> >> division AND a multiplication.
>> >
>> > there are some awful crappy compilers out there.  even ones that start 
>> > from gnu
>> > and somehow become a product sold for use with some DSP.
> Though recent gcc versions will replace the above "a/3.14" with a
> multiplication, I remember a case where the denominator was constant
> as well but not quite as explicitly stated, where gcc 4.x produced a
> division instruction.

not necessarily: in floating point math a/b and a * (1/b) do not yield
the same result. therefore the compile should not optimize this, unless
explicitly asked to do so (-freciprocal-math)

tim
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-09 Thread Tim Goetze
[robert bristow-johnson]
> On 3/9/13 1:31 PM, Wen Xue wrote:
>> I think one can trust the compiler to handle a/3.14 as a multiplication. If 
>> it
>> doesn't it'd probably be worse to write a*(1/3.14), for this would be a
>> division AND a multiplication.
>
> there are some awful crappy compilers out there.  even ones that start from 
> gnu
> and somehow become a product sold for use with some DSP.

Though recent gcc versions will replace the above "a/3.14" with a
multiplication, I remember a case where the denominator was constant
as well but not quite as explicitly stated, where gcc 4.x produced a
division instruction.

(I think the denominator was a c++ template parameter subjected to a
binary shift operator but my memory of the exact circumstances is
hazy.  I do remember it was easy to evaluate at compile time, and my
surprise at the compiler not getting it.)

Tim
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


[music-dsp] Efficiency of clear/copy/offset buffers

2013-03-09 Thread ChordWizard Software
> There is some discussion here:
> http://www.kvraudio.com/forum/viewtopic.php?t=348751

Hey Ross, this was a good thread.  So the pcture I'm getting is this:  

1) a gain curve that is linear in dB will produce an ideal profile where equal 
incremental movements along the curve produce an equal changes in perceived 
volume.

2) you can never get this curve to hit zero amplitude, which occurs at 
dB=-infinity, so you have to patch in a linear segment at the bottom to deal 
with this

3) a quick-and-dirty solution that approximates the linear dB gain curve (and 
handles the zero amplitude case automatically) is a simple x^2 curve (in range 
0.0 to 1.0)

Does it sound like I have the right end of the stick?


> When multiplying, you can do all the necessary multiplications in parallel 
> (think of performing a long multiply by hand?1234 x 5678 for instance.
> It's easy to imagine how you could speed this up by having a few friends help 
> you

Nigel, thanks for this insight.  Makes perfect sense.

Regards,

Stephen Clarke
Managing Director
ChordWizard Software Pty Ltd
corpor...@chordwizard.com
http://www.chordwizard.com
ph: (+61) 2 4960 9520
fax: (+61) 2 4960 9580

--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-09 Thread James C Chandler Jr

On Mar 8, 2013, at 10:55 PM, Ross Bencina wrote:

> If your input is MIDI master volume you have to map from the MIDI value range 
> to linear gain (perhaps via decibels). Maybe there is a standard curve for 
> this?

There may be standards for subsets, such as GM. Or perhaps even a more global 
standard nowadays. Dunno. I have many older MIDI hardware synthesizers, from 
years when collecting them was almost a sickness. The older ones didn't appear 
to follow any standard and in fact even similar models from the same 
manufacturer couldn't be expected to follow an identical "company standard" 
volume curve.

Maybe there is more consistency nowadays? Tis been many years since buying a 
new hardware synth.
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-09 Thread robert bristow-johnson

On 3/9/13 1:31 PM, Wen Xue wrote:
I think one can trust the compiler to handle a/3.14 as a 
multiplication. If it doesn't it'd probably be worse to write 
a*(1/3.14), for this would be a division AND a multiplication.


there are some awful crappy compilers out there.  even ones that start 
from gnu and somehow become a product sold for use with some DSP.


i think this guy named Michael Kahl should be the compiler czar of the 
world.  no compiler or development system is released anywhere in the 
world without his design and/or approval of it.


--

r b-j  r...@audioimagination.com

"Imagination is more important than knowledge."



--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-09 Thread Wen Xue
I think one can trust the compiler to handle a/3.14 as a multiplication. If 
it doesn't it'd probably be worse to write a*(1/3.14), for this would be a 
division AND a multiplication.



-Original Message- 
From: Nigel Redmon

Sent: Saturday, March 09, 2013 5:15 PM
To: A discussion list for music-related DSP
Subject: Re: [music-dsp] Efficiency of clear/copy/offset buffers

On Mar 8, 2013, at 2:53 PM, ChordWizard Software  
wrote:
But some are quite new - I never realised that multiplication ops were 
more efficient than divisions.


Worthy of some background...

When multiplying, you can do all the necessary multiplications in parallel 
(think of performing a long multiply by hand—1234 x 5678 for instance. It's 
easy to imagine how you could speed this up by having a few friends help 
you, where you manage the first digit, 4 x 5678, another handles 3(0) x 
5678, etc., at the same time.) but when you divide, you need to finish one 
digit before you know what the remainder is and you can move to the next 
digit. There's no way to look ahead—you need the result of the first step 
before doing the second. So, processors optimize multiplication and addition 
with parallel circuits, but division is iterated in a microcode loop (or 
done entirely in software). The 56K DSPs, for instance have a single-cycle 
multiply, but for division, "DIV" is a single division iteration—you need to 
do it for every digit you need to generate. It's just the nature of the 
operation.


Compilers may help you optimize constants, but it's always best to keep 
track of things yourself so you know what you're getting. So, yes, multiply 
by the sample period instead of dividing by the sample rate, etc.


--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links

http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp 


--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-09 Thread Nigel Redmon
On Mar 8, 2013, at 2:53 PM, ChordWizard Software  
wrote:
> But some are quite new - I never realised that multiplication ops were more 
> efficient than divisions.

Worthy of some background...

When multiplying, you can do all the necessary multiplications in parallel 
(think of performing a long multiply by hand—1234 x 5678 for instance. It's 
easy to imagine how you could speed this up by having a few friends help you, 
where you manage the first digit, 4 x 5678, another handles 3(0) x 5678, etc., 
at the same time.) but when you divide, you need to finish one digit before you 
know what the remainder is and you can move to the next digit. There's no way 
to look ahead—you need the result of the first step before doing the second. 
So, processors optimize multiplication and addition with parallel circuits, but 
division is iterated in a microcode loop (or done entirely in software). The 
56K DSPs, for instance have a single-cycle multiply, but for division, "DIV" is 
a single division iteration—you need to do it for every digit you need to 
generate. It's just the nature of the operation.

Compilers may help you optimize constants, but it's always best to keep track 
of things yourself so you know what you're getting. So, yes, multiply by the 
sample period instead of dividing by the sample rate, etc.

--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-08 Thread Ross Bencina



On 9/03/2013 2:55 PM, Ross Bencina wrote:

Note that audio faders are not linear in decibels either, e.g.:
http://iub.edu/~emusic/etext/studio/studio_images/mixer9.jpg


There is some discussion here:

http://www.kvraudio.com/forum/viewtopic.php?t=348751


Ross.
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-08 Thread Ross Bencina

On 9/03/2013 9:53 AM, ChordWizard Software wrote:

Maybe you can advise me on a related question - what's the best
approach to implementing attenuation?   I'm guessing it is not
linear, since perceived sound loudness has a logarithmic profile - or
am I confusing amplifier wattage with signal amplitude?


What I do is use a linear scaling value internally -- that's the number 
that multiplies the signal. Let's call it linearGain. linearGain has the 
value 1.0 for unity gain and 0.0 for infinite attenuation.


there is usually some mapping from "userGain"

linearGain = f( userGain );

If userGain  is expressed in decibels you can use the standard decibel 
to amplitude mapping:


linearGain = 10 ^ (gainDb / 20.)


If your input is MIDI master volume you have to map from the MIDI value 
range to linear gain (perhaps via decibels). Maybe there is a standard 
curve for this?



Note that audio faders are not linear in decibels either, e.g.:
http://iub.edu/~emusic/etext/studio/studio_images/mixer9.jpg

Ross.
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-08 Thread robert bristow-johnson

On 3/8/13 5:53 PM, ChordWizard Software wrote:

Ross, Alan, Robert, thanks for the comments.

It's all good sense and very helpful as a reality check.  I had considered some 
of these concepts already, it's good to get these validated (and expanded).

But some are quite new - I never realised that multiplication ops were more 
efficient than divisions.


always multiply by 1/C rather than divide by C. if C is a constant. it 
means your "coefficient cooking" code has to compute the reciprocal, but 
that is not something that need be done at sample time but in the code 
that gets executed when a knob is twisted.



I totally agree with the idea that macro efficiency is a more rewarding 
starting point than micro.  I generally do try to avoid copy routines that 
don't add any other value to the process at the same time.

But it's a tradeoff, isn't it, between efficiency and trying to keep the code 
modular


what i don't understand is why your modular code needs to make 
unnecessary copy operations. *every* instantiation of every module owns 
its own output buffers. and the inputs to every module are other 
modules' outputs (or the same module if you wanna do some delayed 
feedback). why and when do you need to copy? well, other than into a 
delay line buffer (like for FIR or multitap or reverb or similar). but 
that is an integral function of the module to begin with.


with the system I/O i can surely imagine the need to copy out of the 
system buffer to some nice de-interleaved signal buffers. and if your 
system is floating point, it makes sense to me to convert from fixed 
(what comes from the A/D buffer) to float and detangle the left and 
right channel samples. and if there is a global input gain knob, to 
apply that gain on the samples as they are being passed from one buffer 
into another. that's a piece of system code, not part of a module that 
may or may not be instantiated.



  enough that you don't end up with some arcane multi-op tangle that has to get 
duplicated and tweaked for every special case.

Anyway, if the general consensus is that memset and memcpy are reasonably 
efficient then that's my immediate need taken care of, as I’m trying very hard 
to stay cross-platform ready.

Maybe you can advise me on a related question - what's the best approach to 
implementing attenuation?   I'm guessing it is not linear, since perceived 
sound loudness has a logarithmic profile - or am I confusing amplifier wattage 
with signal amplitude?


i've never understood "attenuation" being anything other than a gain 
coefficient with magnitude less than 1. inside your DSP engine, 
"amplitude" is just a number (but we often like to have the rails 
defined at -1 and +1), and when that signal goes out into an amplifier 
and loud speaker, there can be talk of "wattage" in an absolute sense. 
but inside your alg, only relative wattage makes any sense. at least to 
me (maybe i'm missing something, like an obscure standard). multiply 
your signal by a gain coefficient equal to 1/2 (or -1/2) and your 
voltage level (and r.m.s. voltage) in the amp drops to half, your 
wattage drops to 1/4 of the previous level and it's a -6.02 dB change.



The design of my audio engine is to drive a default GM softsynth,


are you coding the softsynth? or hooking up to someone else's?


  with optional overrides for each channel to use a VSTi or alternate 
synth/font instead.

Sysex Master Volume support is by no means assured for all of these possible 
outputs, particularly the VSTs, so I'm realising that I probably need to 
implement my own master volume control at the output.


well you're system output samples come from the output buffer that is 
owned by the module that is connected to the system output. at the end 
of your block processing time, after all of the modules got to process 
their input into their outputs, your system has a pointer to where the 
output blocks are and as you fetch those samples, you might have to 
interlace the samples from multiple channels, you might have to convert 
from float to fixed, and it appears to me that you might want to apply 
that Master Volume gain just before the float-to-fixed conversion.



The obvious approach of course is linear scaling, but something tells me there 
might be a better way to balance the increments of perceived volume difference 
across the whole range?


dunno what that is. a dB step issue?


--

r b-j  r...@audioimagination.com

"Imagination is more important than knowledge."



--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


[music-dsp] Efficiency of clear/copy/offset buffers

2013-03-08 Thread ChordWizard Software
Ross, Alan, Robert, thanks for the comments.

It's all good sense and very helpful as a reality check.  I had considered some 
of these concepts already, it's good to get these validated (and expanded).  

But some are quite new - I never realised that multiplication ops were more 
efficient than divisions.

I totally agree with the idea that macro efficiency is a more rewarding 
starting point than micro.  I generally do try to avoid copy routines that 
don't add any other value to the process at the same time.

But it's a tradeoff, isn't it, between efficiency and trying to keep the code 
modular enough that you don't end up with some arcane multi-op tangle that has 
to get duplicated and tweaked for every special case.

Anyway, if the general consensus is that memset and memcpy are reasonably 
efficient then that's my immediate need taken care of, as I’m trying very hard 
to stay cross-platform ready.

Maybe you can advise me on a related question - what's the best approach to 
implementing attenuation?   I'm guessing it is not linear, since perceived 
sound loudness has a logarithmic profile - or am I confusing amplifier wattage 
with signal amplitude?

The design of my audio engine is to drive a default GM softsynth, with optional 
overrides for each channel to use a VSTi or alternate synth/font instead.

Sysex Master Volume support is by no means assured for all of these possible 
outputs, particularly the VSTs, so I'm realising that I probably need to 
implement my own master volume control at the output.

The obvious approach of course is linear scaling, but something tells me there 
might be a better way to balance the increments of perceived volume difference 
across the whole range?

Regards,

Stephen Clarke
Managing Director
ChordWizard Software Pty Ltd
corpor...@chordwizard.com
http://www.chordwizard.com
ph: (+61) 2 4960 9520
fax: (+61) 2 4960 9580

--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp

Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-07 Thread robert bristow-johnson

On 3/7/13 10:11 PM, Alan Wolfe wrote:

Quick 2 cents of my own to re-emphasize a point that Ross made -
profile to find out which is fastest if you aren't sure (although it's
good to ask too in case different systems have different oddities you
don't know about)

Also, if in the future you have performance issues, profile before
acting for maximum efficiency... often times what we suspect to be the
bottleneck of our application is in fact not the bottleneck at all.
Happens to everyone :P

lastly, copying buffers is an important thing to get right, but in
case you haven't heard this enough, when hitting performance problems
it's often better to do MACRO optimization instead of MICRO
optimization.

Macro optimization means changing your algorithm, being smarter with
the resources you have etc.

Micro optimization means turning multiplications into bitshifts,
breaking out the assembly and things like that.


one thing that makes sense to me, when i was worrying about this, was to 
try to do a few different tasks together in the same operation at a 
system level.  here's a case in point:


in some previous product that will go unnamed because i don't want 
anyone pissed at me for "revealing state secrets", the product had 
multichannel in and multichannel out.  the samples in the A/D and D/A 
DMA buffers were interlaced, fixed point, and scaled for the I/O 
device.  but we wanted the different channel buffers to not be 
interlaced for the internal algs and we wanted the data be converted to 
floating point (i don't like floating point so much, but the processor 
was float and the decision was made by bigger people than me that all 
the algs were to be floating point), and there were user-definable 
global gains going in and coming out of the box.


so i wrote (in assembly) a simple de-interlace, copy, scale, and 
convert-to-float of the samples going in, and the reverse of all of that 
for the samples going out.  doing all four operations together cost 
about the same as just copying the data when done in assembly.  maybe 
some setup overhead, but the sample was yanked from one buffer, 
converted to float, multiplied by the global gain, and stored into one 
of multiple other buffers.  and going out was the reverse.  in between, 
the sorta user-defined algs were mono or multichannel, but looked at 
each channel as just another mono signal in a block or buffer that 
didn't have any confusing interleaving (no "stride" needed, unless it 
was a crude down-sampler and that was part of the alg definition, but 
the algs never had to think about skipping over other channels' samples).


--

r b-j  r...@audioimagination.com

"Imagination is more important than knowledge."



--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-07 Thread Alan Wolfe
Quick 2 cents of my own to re-emphasize a point that Ross made -
profile to find out which is fastest if you aren't sure (although it's
good to ask too in case different systems have different oddities you
don't know about)

Also, if in the future you have performance issues, profile before
acting for maximum efficiency... often times what we suspect to be the
bottleneck of our application is in fact not the bottleneck at all.
Happens to everyone :P

lastly, copying buffers is an important thing to get right, but in
case you haven't heard this enough, when hitting performance problems
it's often better to do MACRO optimization instead of MICRO
optimization.

Macro optimization means changing your algorithm, being smarter with
the resources you have etc.

Micro optimization means turning multiplications into bitshifts,
breaking out the assembly and things like that.

Often times macro optimizations will get you a bigger win (don't
optimize a crappy sorting algorithm, just use a better sorting
algorithm and it'll be way better) and also will result in more
maintainable, portable code, so you should prefer going that route
first.

Hope this helps!

On Thu, Mar 7, 2013 at 2:48 PM, Ross Bencina  wrote:
> Stephen,
>
>
> On 8/03/2013 9:29 AM, ChordWizard Software wrote:
>>
>> a) additive mixing of audio buffers b) clearing to zero before
>> additive processing
>
>
> You could also consider writing (rather than adding) the first signal to the
> buffer. That way you don't have to zero it first. It requires having a
> "write" and an "add" version of your generators. Depending on your code this
> may or may not be worth the trouble vs zeroing first.
>
> In the past I've sometimes used C++ templates to paramaterise by the output
> operation (write/add) so you only have to write the code that generates the
> signals once
>
>
> c) copying from one buffer to another
>
> Of course you should avoid this whereever possible. Consider using
> (reference counted) buffer objects so you can share them instead of copying
> data. You could use reference counting, or just reclaim everything at the
> end of every cycle.
>
>
>
> d) converting between short and float formats
>>
>>
>> No surprises to any of you there I'm sure.  My question is, can you
>> give me a few pointers about making them as efficient as possible
>> within that critical realtime loop?
>>
>> For example, how does the efficiency of memset, or ZeroMemory,
>> compare to a simple for loop?
>
>
> Usually memset has a special case for writing zeros, so you shouldn't see
> too much difference between memset and ZeroMemory.
>
> memset vs simple loop will depend on your compiler.
>
> The usual wisdom is:
>
> 1) use memset vs writing your own. the library implementation will use
> SSE/whatever and will be fast. Of course this depends on the runtime
>
> 2) always profile and compare if you care.
>
>
>
>> Or using HeapAlloc with the
>> HEAP_ZERO_MEMORY flag when the buffer is created (I know buffers
>> shouldn’t be allocated in a realtime callback, but just out of
>> interest, I assume an initial zeroing must come at a cost compared to
>> not using that flag)?
>
>
> It could happen in a few ways, but I'm not sure how it *does* happen on
> Windows and OS X.
>
> For example the MMU could map all the pages to a single zero page and then
> allocate+zero only when there is a write to the page.
>
>
>
>> I'm using Win32 but intend to port to OSX as well, so comments on the
>> merits of cross-platform options like the C RTL would be particularly
>> helpful.  I realise some of those I mention above are Win-specific.
>>
>> Also for converting sample formats, are there more efficient options
>> than simply using
>>
>> nFloat = (float)nShort / 32768.0
>
>
> Unless you have a good reason not to you should prefer multiplication by
> reciprocal for the first one
>
> const float scale = (float)(1. / 32768.0);
> nFloat = (float)nShort * scale;
>
> You can do 4 at once if you use SSE/intrinsics.
>
>
>> nShort = (short)(nFloat * 32768.0)
>
> Float => int conversion can be expensive depending on your compiler settings
> and supported processor architectures. There are various ways around this.
>
> Take a look at pa_converters.c and the pa_x86_plain_converters.c in
> PortAudio. But you can do better with SSE.
>
>
>
>> for every sample?
>>
>> Are there any articles on this type of optimisation that can give me
>> some insight into what is happening behind the various memory
>> management calls?
>
>
> Probably. I would make sure you allocate aligned memory, maybe lock it in
> physical memory, and then use it -- and generally avoid OS-level memory
> calls from then on.
>
> I would use memset() memcpy(). These are optimised and the compiler may even
> inline an even more optimal version.
>
> The alternative is to go low-level and benchmark everything and write your
> own code in SSE (and learn how to optimise it).
>
> If you really care you need a good profiler.
>
> That's my 2c.
>
> HTH
>
> Ross.
>

Re: [music-dsp] Efficiency of clear/copy/offset buffers

2013-03-07 Thread Ross Bencina

Stephen,

On 8/03/2013 9:29 AM, ChordWizard Software wrote:

a) additive mixing of audio buffers b) clearing to zero before
additive processing


You could also consider writing (rather than adding) the first signal to 
the buffer. That way you don't have to zero it first. It requires having 
a "write" and an "add" version of your generators. Depending on your 
code this may or may not be worth the trouble vs zeroing first.


In the past I've sometimes used C++ templates to paramaterise by the 
output operation (write/add) so you only have to write the code that 
generates the signals once


c) copying from one buffer to another

Of course you should avoid this whereever possible. Consider using 
(reference counted) buffer objects so you can share them instead of 
copying data. You could use reference counting, or just reclaim 
everything at the end of every cycle.



d) converting between short and float formats


No surprises to any of you there I'm sure.  My question is, can you
give me a few pointers about making them as efficient as possible
within that critical realtime loop?

For example, how does the efficiency of memset, or ZeroMemory,
compare to a simple for loop?


Usually memset has a special case for writing zeros, so you shouldn't 
see too much difference between memset and ZeroMemory.


memset vs simple loop will depend on your compiler.

The usual wisdom is:

1) use memset vs writing your own. the library implementation will use 
SSE/whatever and will be fast. Of course this depends on the runtime


2) always profile and compare if you care.



Or using HeapAlloc with the
HEAP_ZERO_MEMORY flag when the buffer is created (I know buffers
shouldn’t be allocated in a realtime callback, but just out of
interest, I assume an initial zeroing must come at a cost compared to
not using that flag)?


It could happen in a few ways, but I'm not sure how it *does* happen on 
Windows and OS X.


For example the MMU could map all the pages to a single zero page and 
then allocate+zero only when there is a write to the page.




I'm using Win32 but intend to port to OSX as well, so comments on the
merits of cross-platform options like the C RTL would be particularly
helpful.  I realise some of those I mention above are Win-specific.

Also for converting sample formats, are there more efficient options
than simply using

nFloat = (float)nShort / 32768.0


Unless you have a good reason not to you should prefer multiplication by 
reciprocal for the first one


const float scale = (float)(1. / 32768.0);
nFloat = (float)nShort * scale;

You can do 4 at once if you use SSE/intrinsics.

> nShort = (short)(nFloat * 32768.0)

Float => int conversion can be expensive depending on your compiler 
settings and supported processor architectures. There are various ways 
around this.


Take a look at pa_converters.c and the pa_x86_plain_converters.c in 
PortAudio. But you can do better with SSE.




for every sample?

Are there any articles on this type of optimisation that can give me
some insight into what is happening behind the various memory
management calls?


Probably. I would make sure you allocate aligned memory, maybe lock it 
in physical memory, and then use it -- and generally avoid OS-level 
memory calls from then on.


I would use memset() memcpy(). These are optimised and the compiler may 
even inline an even more optimal version.


The alternative is to go low-level and benchmark everything and write 
your own code in SSE (and learn how to optimise it).


If you really care you need a good profiler.

That's my 2c.

HTH

Ross.






Regards,

Stephen Clarke Managing Director ChordWizard Software Pty Ltd
corpor...@chordwizard.com http://www.chordwizard.com ph: (+61) 2 4960
9520 fax: (+61) 2 4960 9580



-- dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book
reviews, dsp links http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


[music-dsp] Efficiency of clear/copy/offset buffers

2013-03-07 Thread ChordWizard Software
Greetings, and apologies in advance for bringing up what must be a well-covered 
topic on this list, I just couldn't find it in the archives anywhere.

I'm in the final stages of building an audio host/synth engine in C++, and of 
course a large part of its realtime workload is building and transferring audio 
buffers:

a) additive mixing of audio buffers
b) clearing to zero before additive processing
c) copying from one buffer to another
d) converting between short and float formats

No surprises to any of you there I'm sure.  My question is, can you give me a 
few pointers about making them as efficient as possible within that critical 
realtime loop?

For example, how does the efficiency of memset, or ZeroMemory, compare to a 
simple for loop?  Or using HeapAlloc with the HEAP_ZERO_MEMORY flag when the 
buffer is created (I know buffers shouldn’t be allocated in a realtime 
callback, but just out of interest, I assume an initial zeroing must come at a 
cost compared to not using that flag)?

I'm using Win32 but intend to port to OSX as well, so comments on the merits of 
cross-platform options like the C RTL would be particularly helpful.  I realise 
some of those I mention above are Win-specific.

Also for converting sample formats, are there more efficient options than 
simply using

nFloat = (float)nShort / 32768.0
nShort = (short)(nFloat * 32768.0)

for every sample?

Are there any articles on this type of optimisation that can give me some 
insight into what is happening behind the various memory management calls?

Regards,

Stephen Clarke
Managing Director
ChordWizard Software Pty Ltd
corpor...@chordwizard.com
http://www.chordwizard.com
ph: (+61) 2 4960 9520
fax: (+61) 2 4960 9580

--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp