Re: [music-dsp] LLVM or GCC for DSP Architectures

2014-12-13 Thread Stefan Stenzel
I agree with you Paul, the Cortex-M4 is an excellent choice for Audio DSP.

However, besides DSP extension for dual operaions on both halves of 32-bit 
numbers, there 
are also DSP instructions for 32 bit processing that I would recommend over the 
dual 16 bit 
ones. Proper use of these might require some moderate inline assembly. Did that 
in a recent 
project and it was blissfully relaxing to code in C++ with just some inner 
loops optimized 
in assembly. The expense of using 32 bit instead of 16 bit is way lower than a 
factor of 
two for practical purpose, more in the range of a factor of maybe 1.2 - but 
with much 
better audio quality. 

Stefan


 On 09 Dec 2014, at 1:28 , Paul Stoffregen p...@pjrc.com wrote:
 
 On 12/08/2014 01:35 PM, Stefan Sullivan wrote:
 Hey music DSP folks,
 
 I'm wondering if anybody knows much about using these open source compilers
 to compile to various DSP architectures (e.g. SHARC, ARM, TI, etc).
 
 I have some experience with ARM Cortex-M4, using fixed point. Everything in 
 this message is specific only to Cortex-M4, and might apply to the upcoming 
 Cortex-M7, but it's unrelated to other processors.
 
 ARM's marketing material makes a lot of carefully worded claims about DSP 
 extensions which provide up to a 2X speed improvement.  That is true.  But 
 many people mistakenly leap to the conclusion that there's a DSP co-processor 
 or something resembling a DSP architecture in the chip.  In traditional 
 DSP, you'd expect simultaneous fetching of data and coefficient, 
 multiply-accumulate and loop counting.  Cortex-M4 has nothing like that.  
 It's still very much a traditional microprocessor where those operations are 
 separate instructions.
 
 The DSP extensions are designed for 16 bit fixed point data, which you pack 
 into the native 32 bit memory and registers.  Much of the 2X speedup occurs 
 in the loading and storing of data.  With ARM's traditional instructions, 16 
 bit data is sign extended into 32 bit registers on load, and those upper 16 
 bits are discarded when storing it back to memory.  Cortex-M4 also has an 
 optimization in hardware where it detects successive instructions performing 
 similar load or store and combines them into a single burst access on the 
 bus, so the 2nd, 3rd, 4th access take only a single cycle.  In traditional 
 DSP, the architecture loads data in the same cycle as the math is performed.  
 At best, ARM's extension gets the loading overhead close to 0.5 cycles per 16 
 bit word.
 
 Actually using the DSP extensions requires keen awareness and planning of the 
 ARM register usage.  As far as I know, the only way to cause gcc to use them 
 is inline assembly, which is usually wrapped with inline functions or 
 preprocessor macros.  ARM's marketing material makes a lot of claims about 
 how only C programming is needed.  While that's technically true, given an 
 already-written header file with the inline assembly (some commercial 
 compilers have intrinsics which are basically the same thing), the honest 
 truth is assembly code is involved.  Really leveraging these instructions 
 requires careful planning of how many registers you'll use to bring in packed 
 pairs of samples, how many will hold your intermediate calculations, loop 
 counters, pointers, and other overhead.  If you exceed the 12 or 13 available 
 ARM 32 bit registers, the compiler needs to spill variables onto the stack, 
 which ruins any speed benefit you might hope to achieve by going to so much 
 effort to u
 se the DSP extensions.
 
 Another feature of DSP fixed point architectures is automatic saturation 
 (clipping) during addition.  This too is usually done with a separate 
 instruction on ARM.  They do provide a couple add instructions with automatic 
 saturation, but pervasive support for saturation during all calculations is 
 not present.
 
 Looping overhead is also still an issue.  Typically, you would compose your 
 code to process 4, 8, or 16 samples in each loop iteration.  That lets you 
 use the pipeline burst to bring the packed samples in to 2, 3 or 4 registers. 
  Then you'd unroll your code, placing 4, 8 or 16 copies of whatever math 
 you're doing, and store the results to the output buffer, taking advantage of 
 the pipeline burst for writing.  Then you'd suffer looping overhead, which 
 isn't so bad if you're processing 8 or 16 samples per iteration.
 
 I've written a lot about code structure, planning of data packing, and 
 register allocation allocation, so far without any specifics of the actual 
 operations, for a good reason.  Really using the DSP extension is like this.  
 You spend almost all the time (or at least I do) planning this stuff, so you 
 can actually take advantage of the narrow but useful features those 
 instructions provide.
 
 The actual instructions are documented in the ARM v7-M reference manual (ARM 
 document DDI0403D), starting on page 133, section A4.4.3.
 
 Probably the most interesting instruction is SMLALD 

[music-dsp] LLVM or GCC for DSP Architectures

2014-12-08 Thread Stefan Sullivan
Hey music DSP folks,

I'm wondering if anybody knows much about using these open source compilers
to compile to various DSP architectures (e.g. SHARC, ARM, TI, etc). To be
honest I don't know so much about the compilers/toolchains for these
architectures (they are mostly proprietary compilers right?). I'm just
wondering if anybody has hooked the back-end of the compilers to the
architectures to a more widely used compiler.

The reason I ask is because I've done quite a bit of development lately
with C++ and template programming. I'm always struggling with being able to
develop more advanced widely applicable audio code, and being able to
address lower-level DSP architectures. I am assuming that the more advanced
feature set of c++11 (and eventually c++14) would be more slow to appear in
these proprietary compilers.

Thanks all,
Stefan
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] LLVM or GCC for DSP Architectures

2014-12-08 Thread Paul Stoffregen

On 12/08/2014 01:35 PM, Stefan Sullivan wrote:

Hey music DSP folks,

I'm wondering if anybody knows much about using these open source compilers
to compile to various DSP architectures (e.g. SHARC, ARM, TI, etc).


I have some experience with ARM Cortex-M4, using fixed point. Everything 
in this message is specific only to Cortex-M4, and might apply to the 
upcoming Cortex-M7, but it's unrelated to other processors.


ARM's marketing material makes a lot of carefully worded claims about 
DSP extensions which provide up to a 2X speed improvement.  That is 
true.  But many people mistakenly leap to the conclusion that there's a 
DSP co-processor or something resembling a DSP architecture in the 
chip.  In traditional DSP, you'd expect simultaneous fetching of data 
and coefficient, multiply-accumulate and loop counting.  Cortex-M4 has 
nothing like that.  It's still very much a traditional microprocessor 
where those operations are separate instructions.


The DSP extensions are designed for 16 bit fixed point data, which you 
pack into the native 32 bit memory and registers.  Much of the 2X 
speedup occurs in the loading and storing of data.  With ARM's 
traditional instructions, 16 bit data is sign extended into 32 bit 
registers on load, and those upper 16 bits are discarded when storing it 
back to memory.  Cortex-M4 also has an optimization in hardware where it 
detects successive instructions performing similar load or store and 
combines them into a single burst access on the bus, so the 2nd, 3rd, 
4th access take only a single cycle.  In traditional DSP, the 
architecture loads data in the same cycle as the math is performed.  At 
best, ARM's extension gets the loading overhead close to 0.5 cycles per 
16 bit word.


Actually using the DSP extensions requires keen awareness and planning 
of the ARM register usage.  As far as I know, the only way to cause gcc 
to use them is inline assembly, which is usually wrapped with inline 
functions or preprocessor macros.  ARM's marketing material makes a lot 
of claims about how only C programming is needed.  While that's 
technically true, given an already-written header file with the inline 
assembly (some commercial compilers have intrinsics which are 
basically the same thing), the honest truth is assembly code is 
involved.  Really leveraging these instructions requires careful 
planning of how many registers you'll use to bring in packed pairs of 
samples, how many will hold your intermediate calculations, loop 
counters, pointers, and other overhead.  If you exceed the 12 or 13 
available ARM 32 bit registers, the compiler needs to spill variables 
onto the stack, which ruins any speed benefit you might hope to achieve 
by going to so much effort to use the DSP extensions.


Another feature of DSP fixed point architectures is automatic saturation 
(clipping) during addition.  This too is usually done with a separate 
instruction on ARM.  They do provide a couple add instructions with 
automatic saturation, but pervasive support for saturation during all 
calculations is not present.


Looping overhead is also still an issue.  Typically, you would compose 
your code to process 4, 8, or 16 samples in each loop iteration.  That 
lets you use the pipeline burst to bring the packed samples in to 2, 3 
or 4 registers.  Then you'd unroll your code, placing 4, 8 or 16 copies 
of whatever math you're doing, and store the results to the output 
buffer, taking advantage of the pipeline burst for writing.  Then you'd 
suffer looping overhead, which isn't so bad if you're processing 8 or 16 
samples per iteration.


I've written a lot about code structure, planning of data packing, and 
register allocation allocation, so far without any specifics of the 
actual operations, for a good reason.  Really using the DSP extension is 
like this.  You spend almost all the time (or at least I do) planning 
this stuff, so you can actually take advantage of the narrow but useful 
features those instructions provide.


The actual instructions are documented in the ARM v7-M reference manual 
(ARM document DDI0403D), starting on page 133, section A4.4.3.


Probably the most interesting instruction is SMLALD  SMLALDX. It 
performs two 16x16 signed multiplies and adds both products a signed 64 
bit accumulator.   The 4 numbers to multiply have to be packed into 2 
normal 32 bit ARM registers.  SMLALD multiplies the lower halfs together 
and the upper halves together, and SMLALDX multiplies the lower half in 
one register with the upper half in the other, and vise-versa.  No other 
combinations are possible, so you must arrange your data appropriately 
if you want to get 2 multiply-accumulate in a single cycle.  But there 
is a version that subtract one of the products.  There's also versions 
that accumulate to only 32 bits, which give you one extra precious 32 
bit register, in cases where you're sure overflow isn't an issue 
(remember, these don't automatically saturate if your 

Re: [music-dsp] LLVM or GCC for DSP Architectures

2014-12-08 Thread Sham Beam
Thanks for the info Paul. I've been considering using Teensy for for a 
DIY project for a while now. Just haven't found the time to start yet.



Shannon


On 12/9/2014 11:28 AM, Paul Stoffregen wrote:

On 12/08/2014 01:35 PM, Stefan Sullivan wrote:

Hey music DSP folks,

I'm wondering if anybody knows much about using these open source
compilers
to compile to various DSP architectures (e.g. SHARC, ARM, TI, etc).


I have some experience with ARM Cortex-M4, using fixed point. Everything
in this message is specific only to Cortex-M4, and might apply to the
upcoming Cortex-M7, but it's unrelated to other processors.

ARM's marketing material makes a lot of carefully worded claims about
DSP extensions which provide up to a 2X speed improvement.  That is
true.  But many people mistakenly leap to the conclusion that there's a
DSP co-processor or something resembling a DSP architecture in the
chip.  In traditional DSP, you'd expect simultaneous fetching of data
and coefficient, multiply-accumulate and loop counting.  Cortex-M4 has
nothing like that.  It's still very much a traditional microprocessor
where those operations are separate instructions.

The DSP extensions are designed for 16 bit fixed point data, which you
pack into the native 32 bit memory and registers.  Much of the 2X
speedup occurs in the loading and storing of data.  With ARM's
traditional instructions, 16 bit data is sign extended into 32 bit
registers on load, and those upper 16 bits are discarded when storing it
back to memory.  Cortex-M4 also has an optimization in hardware where it
detects successive instructions performing similar load or store and
combines them into a single burst access on the bus, so the 2nd, 3rd,
4th access take only a single cycle.  In traditional DSP, the
architecture loads data in the same cycle as the math is performed.  At
best, ARM's extension gets the loading overhead close to 0.5 cycles per
16 bit word.

Actually using the DSP extensions requires keen awareness and planning
of the ARM register usage.  As far as I know, the only way to cause gcc
to use them is inline assembly, which is usually wrapped with inline
functions or preprocessor macros.  ARM's marketing material makes a lot
of claims about how only C programming is needed.  While that's
technically true, given an already-written header file with the inline
assembly (some commercial compilers have intrinsics which are
basically the same thing), the honest truth is assembly code is
involved.  Really leveraging these instructions requires careful
planning of how many registers you'll use to bring in packed pairs of
samples, how many will hold your intermediate calculations, loop
counters, pointers, and other overhead.  If you exceed the 12 or 13
available ARM 32 bit registers, the compiler needs to spill variables
onto the stack, which ruins any speed benefit you might hope to achieve
by going to so much effort to use the DSP extensions.

Another feature of DSP fixed point architectures is automatic saturation
(clipping) during addition.  This too is usually done with a separate
instruction on ARM.  They do provide a couple add instructions with
automatic saturation, but pervasive support for saturation during all
calculations is not present.

Looping overhead is also still an issue.  Typically, you would compose
your code to process 4, 8, or 16 samples in each loop iteration.  That
lets you use the pipeline burst to bring the packed samples in to 2, 3
or 4 registers.  Then you'd unroll your code, placing 4, 8 or 16 copies
of whatever math you're doing, and store the results to the output
buffer, taking advantage of the pipeline burst for writing.  Then you'd
suffer looping overhead, which isn't so bad if you're processing 8 or 16
samples per iteration.

I've written a lot about code structure, planning of data packing, and
register allocation allocation, so far without any specifics of the
actual operations, for a good reason.  Really using the DSP extension is
like this.  You spend almost all the time (or at least I do) planning
this stuff, so you can actually take advantage of the narrow but useful
features those instructions provide.

The actual instructions are documented in the ARM v7-M reference manual
(ARM document DDI0403D), starting on page 133, section A4.4.3.

Probably the most interesting instruction is SMLALD  SMLALDX. It
performs two 16x16 signed multiplies and adds both products a signed 64
bit accumulator.   The 4 numbers to multiply have to be packed into 2
normal 32 bit ARM registers.  SMLALD multiplies the lower halfs together
and the upper halves together, and SMLALDX multiplies the lower half in
one register with the upper half in the other, and vise-versa.  No other
combinations are possible, so you must arrange your data appropriately
if you want to get 2 multiply-accumulate in a single cycle.  But there
is a version that subtract one of the products.  There's also versions
that accumulate to only 32 bits, which give you one