Re: [music-dsp] LLVM or GCC for DSP Architectures
I agree with you Paul, the Cortex-M4 is an excellent choice for Audio DSP. However, besides DSP extension for dual operaions on both halves of 32-bit numbers, there are also DSP instructions for 32 bit processing that I would recommend over the dual 16 bit ones. Proper use of these might require some moderate inline assembly. Did that in a recent project and it was blissfully relaxing to code in C++ with just some inner loops optimized in assembly. The expense of using 32 bit instead of 16 bit is way lower than a factor of two for practical purpose, more in the range of a factor of maybe 1.2 - but with much better audio quality. Stefan On 09 Dec 2014, at 1:28 , Paul Stoffregen p...@pjrc.com wrote: On 12/08/2014 01:35 PM, Stefan Sullivan wrote: Hey music DSP folks, I'm wondering if anybody knows much about using these open source compilers to compile to various DSP architectures (e.g. SHARC, ARM, TI, etc). I have some experience with ARM Cortex-M4, using fixed point. Everything in this message is specific only to Cortex-M4, and might apply to the upcoming Cortex-M7, but it's unrelated to other processors. ARM's marketing material makes a lot of carefully worded claims about DSP extensions which provide up to a 2X speed improvement. That is true. But many people mistakenly leap to the conclusion that there's a DSP co-processor or something resembling a DSP architecture in the chip. In traditional DSP, you'd expect simultaneous fetching of data and coefficient, multiply-accumulate and loop counting. Cortex-M4 has nothing like that. It's still very much a traditional microprocessor where those operations are separate instructions. The DSP extensions are designed for 16 bit fixed point data, which you pack into the native 32 bit memory and registers. Much of the 2X speedup occurs in the loading and storing of data. With ARM's traditional instructions, 16 bit data is sign extended into 32 bit registers on load, and those upper 16 bits are discarded when storing it back to memory. Cortex-M4 also has an optimization in hardware where it detects successive instructions performing similar load or store and combines them into a single burst access on the bus, so the 2nd, 3rd, 4th access take only a single cycle. In traditional DSP, the architecture loads data in the same cycle as the math is performed. At best, ARM's extension gets the loading overhead close to 0.5 cycles per 16 bit word. Actually using the DSP extensions requires keen awareness and planning of the ARM register usage. As far as I know, the only way to cause gcc to use them is inline assembly, which is usually wrapped with inline functions or preprocessor macros. ARM's marketing material makes a lot of claims about how only C programming is needed. While that's technically true, given an already-written header file with the inline assembly (some commercial compilers have intrinsics which are basically the same thing), the honest truth is assembly code is involved. Really leveraging these instructions requires careful planning of how many registers you'll use to bring in packed pairs of samples, how many will hold your intermediate calculations, loop counters, pointers, and other overhead. If you exceed the 12 or 13 available ARM 32 bit registers, the compiler needs to spill variables onto the stack, which ruins any speed benefit you might hope to achieve by going to so much effort to u se the DSP extensions. Another feature of DSP fixed point architectures is automatic saturation (clipping) during addition. This too is usually done with a separate instruction on ARM. They do provide a couple add instructions with automatic saturation, but pervasive support for saturation during all calculations is not present. Looping overhead is also still an issue. Typically, you would compose your code to process 4, 8, or 16 samples in each loop iteration. That lets you use the pipeline burst to bring the packed samples in to 2, 3 or 4 registers. Then you'd unroll your code, placing 4, 8 or 16 copies of whatever math you're doing, and store the results to the output buffer, taking advantage of the pipeline burst for writing. Then you'd suffer looping overhead, which isn't so bad if you're processing 8 or 16 samples per iteration. I've written a lot about code structure, planning of data packing, and register allocation allocation, so far without any specifics of the actual operations, for a good reason. Really using the DSP extension is like this. You spend almost all the time (or at least I do) planning this stuff, so you can actually take advantage of the narrow but useful features those instructions provide. The actual instructions are documented in the ARM v7-M reference manual (ARM document DDI0403D), starting on page 133, section A4.4.3. Probably the most interesting instruction is SMLALD
[music-dsp] LLVM or GCC for DSP Architectures
Hey music DSP folks, I'm wondering if anybody knows much about using these open source compilers to compile to various DSP architectures (e.g. SHARC, ARM, TI, etc). To be honest I don't know so much about the compilers/toolchains for these architectures (they are mostly proprietary compilers right?). I'm just wondering if anybody has hooked the back-end of the compilers to the architectures to a more widely used compiler. The reason I ask is because I've done quite a bit of development lately with C++ and template programming. I'm always struggling with being able to develop more advanced widely applicable audio code, and being able to address lower-level DSP architectures. I am assuming that the more advanced feature set of c++11 (and eventually c++14) would be more slow to appear in these proprietary compilers. Thanks all, Stefan -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp
Re: [music-dsp] LLVM or GCC for DSP Architectures
On 12/08/2014 01:35 PM, Stefan Sullivan wrote: Hey music DSP folks, I'm wondering if anybody knows much about using these open source compilers to compile to various DSP architectures (e.g. SHARC, ARM, TI, etc). I have some experience with ARM Cortex-M4, using fixed point. Everything in this message is specific only to Cortex-M4, and might apply to the upcoming Cortex-M7, but it's unrelated to other processors. ARM's marketing material makes a lot of carefully worded claims about DSP extensions which provide up to a 2X speed improvement. That is true. But many people mistakenly leap to the conclusion that there's a DSP co-processor or something resembling a DSP architecture in the chip. In traditional DSP, you'd expect simultaneous fetching of data and coefficient, multiply-accumulate and loop counting. Cortex-M4 has nothing like that. It's still very much a traditional microprocessor where those operations are separate instructions. The DSP extensions are designed for 16 bit fixed point data, which you pack into the native 32 bit memory and registers. Much of the 2X speedup occurs in the loading and storing of data. With ARM's traditional instructions, 16 bit data is sign extended into 32 bit registers on load, and those upper 16 bits are discarded when storing it back to memory. Cortex-M4 also has an optimization in hardware where it detects successive instructions performing similar load or store and combines them into a single burst access on the bus, so the 2nd, 3rd, 4th access take only a single cycle. In traditional DSP, the architecture loads data in the same cycle as the math is performed. At best, ARM's extension gets the loading overhead close to 0.5 cycles per 16 bit word. Actually using the DSP extensions requires keen awareness and planning of the ARM register usage. As far as I know, the only way to cause gcc to use them is inline assembly, which is usually wrapped with inline functions or preprocessor macros. ARM's marketing material makes a lot of claims about how only C programming is needed. While that's technically true, given an already-written header file with the inline assembly (some commercial compilers have intrinsics which are basically the same thing), the honest truth is assembly code is involved. Really leveraging these instructions requires careful planning of how many registers you'll use to bring in packed pairs of samples, how many will hold your intermediate calculations, loop counters, pointers, and other overhead. If you exceed the 12 or 13 available ARM 32 bit registers, the compiler needs to spill variables onto the stack, which ruins any speed benefit you might hope to achieve by going to so much effort to use the DSP extensions. Another feature of DSP fixed point architectures is automatic saturation (clipping) during addition. This too is usually done with a separate instruction on ARM. They do provide a couple add instructions with automatic saturation, but pervasive support for saturation during all calculations is not present. Looping overhead is also still an issue. Typically, you would compose your code to process 4, 8, or 16 samples in each loop iteration. That lets you use the pipeline burst to bring the packed samples in to 2, 3 or 4 registers. Then you'd unroll your code, placing 4, 8 or 16 copies of whatever math you're doing, and store the results to the output buffer, taking advantage of the pipeline burst for writing. Then you'd suffer looping overhead, which isn't so bad if you're processing 8 or 16 samples per iteration. I've written a lot about code structure, planning of data packing, and register allocation allocation, so far without any specifics of the actual operations, for a good reason. Really using the DSP extension is like this. You spend almost all the time (or at least I do) planning this stuff, so you can actually take advantage of the narrow but useful features those instructions provide. The actual instructions are documented in the ARM v7-M reference manual (ARM document DDI0403D), starting on page 133, section A4.4.3. Probably the most interesting instruction is SMLALD SMLALDX. It performs two 16x16 signed multiplies and adds both products a signed 64 bit accumulator. The 4 numbers to multiply have to be packed into 2 normal 32 bit ARM registers. SMLALD multiplies the lower halfs together and the upper halves together, and SMLALDX multiplies the lower half in one register with the upper half in the other, and vise-versa. No other combinations are possible, so you must arrange your data appropriately if you want to get 2 multiply-accumulate in a single cycle. But there is a version that subtract one of the products. There's also versions that accumulate to only 32 bits, which give you one extra precious 32 bit register, in cases where you're sure overflow isn't an issue (remember, these don't automatically saturate if your
Re: [music-dsp] LLVM or GCC for DSP Architectures
Thanks for the info Paul. I've been considering using Teensy for for a DIY project for a while now. Just haven't found the time to start yet. Shannon On 12/9/2014 11:28 AM, Paul Stoffregen wrote: On 12/08/2014 01:35 PM, Stefan Sullivan wrote: Hey music DSP folks, I'm wondering if anybody knows much about using these open source compilers to compile to various DSP architectures (e.g. SHARC, ARM, TI, etc). I have some experience with ARM Cortex-M4, using fixed point. Everything in this message is specific only to Cortex-M4, and might apply to the upcoming Cortex-M7, but it's unrelated to other processors. ARM's marketing material makes a lot of carefully worded claims about DSP extensions which provide up to a 2X speed improvement. That is true. But many people mistakenly leap to the conclusion that there's a DSP co-processor or something resembling a DSP architecture in the chip. In traditional DSP, you'd expect simultaneous fetching of data and coefficient, multiply-accumulate and loop counting. Cortex-M4 has nothing like that. It's still very much a traditional microprocessor where those operations are separate instructions. The DSP extensions are designed for 16 bit fixed point data, which you pack into the native 32 bit memory and registers. Much of the 2X speedup occurs in the loading and storing of data. With ARM's traditional instructions, 16 bit data is sign extended into 32 bit registers on load, and those upper 16 bits are discarded when storing it back to memory. Cortex-M4 also has an optimization in hardware where it detects successive instructions performing similar load or store and combines them into a single burst access on the bus, so the 2nd, 3rd, 4th access take only a single cycle. In traditional DSP, the architecture loads data in the same cycle as the math is performed. At best, ARM's extension gets the loading overhead close to 0.5 cycles per 16 bit word. Actually using the DSP extensions requires keen awareness and planning of the ARM register usage. As far as I know, the only way to cause gcc to use them is inline assembly, which is usually wrapped with inline functions or preprocessor macros. ARM's marketing material makes a lot of claims about how only C programming is needed. While that's technically true, given an already-written header file with the inline assembly (some commercial compilers have intrinsics which are basically the same thing), the honest truth is assembly code is involved. Really leveraging these instructions requires careful planning of how many registers you'll use to bring in packed pairs of samples, how many will hold your intermediate calculations, loop counters, pointers, and other overhead. If you exceed the 12 or 13 available ARM 32 bit registers, the compiler needs to spill variables onto the stack, which ruins any speed benefit you might hope to achieve by going to so much effort to use the DSP extensions. Another feature of DSP fixed point architectures is automatic saturation (clipping) during addition. This too is usually done with a separate instruction on ARM. They do provide a couple add instructions with automatic saturation, but pervasive support for saturation during all calculations is not present. Looping overhead is also still an issue. Typically, you would compose your code to process 4, 8, or 16 samples in each loop iteration. That lets you use the pipeline burst to bring the packed samples in to 2, 3 or 4 registers. Then you'd unroll your code, placing 4, 8 or 16 copies of whatever math you're doing, and store the results to the output buffer, taking advantage of the pipeline burst for writing. Then you'd suffer looping overhead, which isn't so bad if you're processing 8 or 16 samples per iteration. I've written a lot about code structure, planning of data packing, and register allocation allocation, so far without any specifics of the actual operations, for a good reason. Really using the DSP extension is like this. You spend almost all the time (or at least I do) planning this stuff, so you can actually take advantage of the narrow but useful features those instructions provide. The actual instructions are documented in the ARM v7-M reference manual (ARM document DDI0403D), starting on page 133, section A4.4.3. Probably the most interesting instruction is SMLALD SMLALDX. It performs two 16x16 signed multiplies and adds both products a signed 64 bit accumulator. The 4 numbers to multiply have to be packed into 2 normal 32 bit ARM registers. SMLALD multiplies the lower halfs together and the upper halves together, and SMLALDX multiplies the lower half in one register with the upper half in the other, and vise-versa. No other combinations are possible, so you must arrange your data appropriately if you want to get 2 multiply-accumulate in a single cycle. But there is a version that subtract one of the products. There's also versions that accumulate to only 32 bits, which give you one