Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

2016-09-17 Thread Danilo Beuche
Hi Glen, I would not worry to much:

- Maybe gcc 5.4 vs 4.9: difference is ~-20% (depending from which end
you are looking).  It is a lot but not unexplainable.

- Maybe it is my test data. I don't know how much jitter in the kiss_fft
algorithm is, when different data is presented. I am running
"artificially" generated audio input (digitally captured codec2 frames
from a single 750Hz sine way also generated digitally).-

- Maybe it is my strange way of running the mcHF firmware: the mcHF
Hardware has a 16Mhz XO,  but the discovery board which I have here for
testing has a 8Mhz XO. I didn't bother to reconfigure the PLL. So
everything takes twice the time. If the flash would asynchronously
coupled, which I doubt (otherwise no need for explicit wait state
settings),  it would have an influence.  But here I am quite sure, this
is not the case. If the caches are asynchronous: Maybe. Maybe I should
remeasure with fixed PLL setup so that the processor runs at true
168Mhz. Will do that later and get back with updated numbers.

Danilo





On 18.09.2016 06:35, glen english wrote:
> Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9
>
> using cycle counter (yes)
>
> interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs 
> only set off flags...
>
> for kissfft 5ws F4, I wonder why you have 112500 cycles and I have 
> 141000. Something for me to look at .
>
> hmm
>
> -O2 but I also have a bunch of debug symbol stuff in there dunno I think 
> it is only symbol data at DB2 which pushes up the image size.
>
>
>
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2


--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2


Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

2016-09-17 Thread glen english
Using environment Rowley CrossStudio for ARM 3.6.4 . GCC 4.9

using cycle counter (yes)

interrupt overhead : (irrelevant, most likely in my setup) (asm) irqs 
only set off flags...

for kissfft 5ws F4, I wonder why you have 112500 cycles and I have 
141000. Something for me to look at .

hmm

-O2 but I also have a bunch of debug symbol stuff in there dunno I think 
it is only symbol data at DB2 which pushes up the image size.




--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2


Re: [Freetel-codec2] arm fft F4 runs 6WS and 5WS

2016-09-17 Thread Danilo Beuche
Hi Glen,

what and how are you measuring, in order to repeat measurements in my
environment.

We use the cycle counter of the CORTEX processor (seems you do the
same), interrupt overhead is removed (counter is stopped during the
major interrupt, which is the audio irq).

I have some prerecorded "frames" stored in flash which are feed into the
freedv_comprx routine. Measurements are taken every 50 frames, since
initially the decoder needs to lock on the data.

After 50 frames I get stable measurements.

I did some hotspot analysis and a single kiss_fft 512 fft call was
taking around 670uS  (*168) = 112500 cycles.

So this is in line with your numbers.

My results for kiss_fft vs arm_fft may not be correct, since in this
case fact I was running kiss_fftr vs. arm_rfft_fast_f32.

Maybe there is a huge penalty for the arm_rfft_fast_f32 code.

Let me try the kiss_fft vs. arm_cfft case  to confirm your measurements.

Danilo



On 18.09.2016 06:07, glen english wrote:
> arm fft, stm32F405RGT6, 6WS
> -O2, debug level2.
>
> encode 1200 (40mS frame)
> 1745799 : 10.39mS
> decode
> 2292497 : 13.64mS
>
> so, 2x speed of my other runs
>
> let's run 5WS
>
> encode 1200 (40mS frame)
> 1579922 : 9.4mS
> decode
> 2073979 : 12.34mS
>
> WOW the wait states hurt on the M4 
>
>
>
>
>
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2


--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2


Re: [Freetel-codec2] F4 runs.....

2016-09-17 Thread Danilo Beuche
Hi Glen,

difficult to say:

I have data for freedv_comprx() for FreeDV 1600 (translates to decode
1300 because of use of some bits for data transmission)  which does
modem decode every 20ms and voice decode very 40ms.

On average I have 12.2 ms -> 24.4ms, 2x5ms are the fdmdv_demod part, so
decode1300 is 14.4ms.

I am using gcc 5.4, O2 mostly (some files have O3 enable, but none of
the codec2 files.

And I run more or less newest SVN code (codec2-dev 2875)

If you run somewhat older code, the numbers you gave have been the
performance a few days ago. As David mentioned, we gained about 40%
since r2842 (or so). So they could be right.

Danilo




On 18.09.2016 05:57, glen english wrote:
> kiss fft per standard codec2 code...
>
> -O2, debug level2.
> encode 3200 (20mS) frame
> encode : 1388964 cycles: 8.26mS
> decode : 1807440 cycles : 10.75mS
>
> encode 1200 (40mS frame)
> 3006497  (17.89mS)
> decode
> 3669321 (21.8mS)
>
> hmm seems pretty slow
> I wonder what I am doing wrong ?
> Or is that inline with other's measurments ?
>
> kissFFT512 : 155,483 cycles (versus 70,000 on the M7)
>
> Next... arm asm on F4 (STM32F405RGT6)
>
>
>
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2


--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2


[Freetel-codec2] arm fft F4 runs 6WS and 5WS

2016-09-17 Thread glen english
arm fft, stm32F405RGT6, 6WS

-O2, debug level2.

encode 1200 (40mS frame)
1745799 : 10.39mS
decode
2292497 : 13.64mS

so, 2x speed of my other runs

let's run 5WS

encode 1200 (40mS frame)
1579922 : 9.4mS
decode
2073979 : 12.34mS

WOW the wait states hurt on the M4 






--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2


[Freetel-codec2] F4 runs.....

2016-09-17 Thread glen english
kiss fft per standard codec2 code...

-O2, debug level2.
encode 3200 (20mS) frame
encode : 1388964 cycles: 8.26mS
decode : 1807440 cycles : 10.75mS

encode 1200 (40mS frame)
3006497  (17.89mS)
decode
3669321 (21.8mS)

hmm seems pretty slow
I wonder what I am doing wrong ?
Or is that inline with other's measurments ?

kissFFT512 : 155,483 cycles (versus 70,000 on the M7)

Next... arm asm on F4 (STM32F405RGT6)




--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2


Re: [Freetel-codec2] more benching and thoughts

2016-09-17 Thread Danilo Beuche
Hi Glen,

Am 18.09.2016 um 05:22 schrieb glen english:
> Hi Danilo
>
> Yes, there is most-certainly a penalty for const access from flash, at
> least on the M4.
>
> and of course instruction cache is no use, is just that, only for
> instructions
Hence the name :-)
> I wonder what the bus matrix penalty is for data fetches from flash.
> There is an app-note about it somewhere I once read. It depends on what
> else is going on. The processor is pretty smart and interleaving the
> accesses as not to stall the pipeline or bus matrix.
>
> As Danilo I am sure you know : (pointed out for others)
> You can force variables into sections (like forcing a static const ) by
> using
>
> __attribute__((section("name"))) to assign say into data.
Yes,  indeed. We use that extensively at the mcHF to move certain parts 
to the CCM memory which is otherwise not so easily accessible (and 
should be used with care as it does not support DMA to/from peripherals).
Works great.
> I will some time run up the code on an M4 and see what kiss-fft does.
>
> I am very very very surprised , and do not really believe that kissFFT
> is as fast as the arm assembler  on the M4 -  my immediate thoughts are
> "you are doing it wrong". so, I will investigate.
Would be happy if I am wrong. Which would mean we get even faster fft on 
the M4 almost for free (minus investigation time that is). No problem at 
all with me :-)


Danilo



--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2


[Freetel-codec2] memory management options

2016-09-17 Thread glen english
David
Have a look at

http://www.freertos.org/a00111.html

I use heap2.c for most things... works on the premise of repeated 
requests and returns of things the same size.

heap4 is good, also.



--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2


Re: [Freetel-codec2] more benching and thoughts

2016-09-17 Thread glen english
Hi Danilo

Yes, there is most-certainly a penalty for const access from flash, at 
least on the M4.

and of course instruction cache is no use, is just that, only for 
instructions

I wonder what the bus matrix penalty is for data fetches from flash. 
There is an app-note about it somewhere I once read. It depends on what 
else is going on. The processor is pretty smart and interleaving the 
accesses as not to stall the pipeline or bus matrix.

As Danilo I am sure you know : (pointed out for others)
You can force variables into sections (like forcing a static const ) by 
using

__attribute__((section("name"))) to assign say into data.

I will some time run up the code on an M4 and see what kiss-fft does.

I am very very very surprised , and do not really believe that kissFFT 
is as fast as the arm assembler  on the M4 -  my immediate thoughts are 
"you are doing it wrong". so, I will investigate.

g









On 18/09/2016 12:13 PM, Danilo Beuche wrote:
> Hi Glen,
>
> just checked, it seems to me that we have all the caches running in the
> mcHF:
>
> https://github.com/df8oe/mchf-github/blob/670f94a2e69a55a03f099ad25390925c84c09201/mchf-eclipse/cmsis_boot/system_stm32f4xx.c#L424
>
> So even with these caches/buffers enabled, the M4 looses performance  by
> data reads from flash. Haven't checked the manual/internet but maybe the
> flash caching works well for code but not in the same way for data.
>
>
> Danilo
>
> Am 18.09.2016 um 03:12 schrieb glen english:
>> Hi Danilo
>>
>> Good thoughts and points.
>>
>> while on the RAM subject : require reading for every serious programmer :
>> "Memory"
>> https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich
>>
>> read all 7 parts, 100 pages, but if you only have an hour, just read
>> "part 2 - cache"
>>
>> On the fft:
>>
>> I am not surprised that the ARM lib  hand optimized assembler is that
>> much faster.
>> more that 2x faster in fact.
>>
>> I don't think kiss-fft is particular suitable for this sort of platform,
>> either, I'll hold back what I really think :-) .
>>
>> The 5WS on flash (actually 6WS I am running @ 168M) does not really
>> affect the performance too much.  In fact I can vary the WS count +/- 2
>> without much change- the ART and the prefetch and the instruction and
>> data caches are doing their job, so there is very little difference with
>> the const values in ram or cache.
>>
>> In fact, most  FFT implementations are very tough on a machine with cache .
>> Have you read the paper on how FFTW works ? It is very cache aware- and
>> adaptive to the architecture- that is why it does trial runs and picks
>> the best.
>>
>> The M7 is very impressive. It is certainly impressive work by ARM.
>>
>> However, the M4 is what all of you have to work with so we can stay
>> focussed on that.
>>
>> I think also the ram usage will be significantly less with the arm FFT
>> because of the re-entrant Kiss-fft behaviour.
>>
>> The m4 is quite a different beast, and no D-cache can improve
>> performance over the M7 for some (inaptly) written applications (not
>> this one- but as a generalization for applications grabbing a byte from
>> memory randomly and all over a large dataset)
>>
>> Large matrix operations are where cache machines fall over- that is once
>> the dataset is bigger than the cache
>>
>> The question is how much optimization is enough. I am tempted NOT to
>> optimize any more, although I feel (just by looking at it )  I can get
>> another 2x out of it. Why- well there is no real pressing need.
>> Going too far away from the reference code will island the code a bit.
>> However, if you run out of modem cycles/ modem ram, then we can probably
>> get a bit more...
>>
>> cheers
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> ___
>> Freetel-codec2 mailing list
>> Freetel-codec2@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/freetel-codec2
>
>
> --
> ___
> Freetel-codec2 mailing list
> Freetel-codec2@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/freetel-codec2
>



--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2


Re: [Freetel-codec2] more benching and thoughts

2016-09-17 Thread glen english
Hi Danilo

Good thoughts and points.

while on the RAM subject : require reading for every serious programmer :
"Memory"
https://lwn.net/Archives/GuestIndex/#Drepper_Ulrich

read all 7 parts, 100 pages, but if you only have an hour, just read 
"part 2 - cache"

On the fft:

I am not surprised that the ARM lib  hand optimized assembler is that 
much faster.
more that 2x faster in fact.

I don't think kiss-fft is particular suitable for this sort of platform, 
either, I'll hold back what I really think :-) .

The 5WS on flash (actually 6WS I am running @ 168M) does not really 
affect the performance too much.  In fact I can vary the WS count +/- 2 
without much change- the ART and the prefetch and the instruction and 
data caches are doing their job, so there is very little difference with 
the const values in ram or cache.

In fact, most  FFT implementations are very tough on a machine with cache .
Have you read the paper on how FFTW works ? It is very cache aware- and 
adaptive to the architecture- that is why it does trial runs and picks 
the best.

The M7 is very impressive. It is certainly impressive work by ARM.

However, the M4 is what all of you have to work with so we can stay 
focussed on that.

I think also the ram usage will be significantly less with the arm FFT 
because of the re-entrant Kiss-fft behaviour.

The m4 is quite a different beast, and no D-cache can improve 
performance over the M7 for some (inaptly) written applications (not 
this one- but as a generalization for applications grabbing a byte from 
memory randomly and all over a large dataset)

Large matrix operations are where cache machines fall over- that is once 
the dataset is bigger than the cache

The question is how much optimization is enough. I am tempted NOT to 
optimize any more, although I feel (just by looking at it )  I can get 
another 2x out of it. Why- well there is no real pressing need. 
Going too far away from the reference code will island the code a bit.
However, if you run out of modem cycles/ modem ram, then we can probably 
get a bit more...

cheers









--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2


Re: [Freetel-codec2] more benching and thoughts

2016-09-17 Thread Danilo Beuche

Hi,

I'd like to share a few thoughts/ideas as well:

Since we finished and removed most of the easy to remove performance 
hotspots with the exception of the kiss_fft calls which now contribute a 
major part to the overall runtime of modem and decoder, I played around 
with kiss_fft vs. arm DSP fft on the STM32F4 with some of the newer libs.


Turns out at least for the real fft (kiss_fftr vs. arm_rffft_fast_f32) 
the time difference is not existing if used in our mcHF code. How does 
this relate to the measurements of Glen (he measure better performance)?
I believe this is due to the fact that the arm lib stores some of its 
data in precomputed flash arrays. Access to flash is slow (5 wait 
states), so this reduces the performance. kiss has all its data in RAM. 
Since Glen did initialize the arm DSP tables in RAM, he got speed gains 
on the expense of RAM. On a STM32F746 RAM this is not as much of an 
issue (384K) as it is on the STM32F4 (the default MCU in the mcHF 
project has 192K RAM and we have to fit the full SDR firmware RAM needs 
into it this space). Speed is traded for RAM use reduction. Since have 
reached our goal timewise, now memory reductions is more in focus for us 
(but it should not get slower of course).


Because of that I would like to propose the following approach to keep 
the code easily readable while providing efficient solutions for the ARM 
MCUs both with little and more RAM:


1. We create an abstract interface for running fft in the codec2 
sources. Initially this should closely resemble the existing kiss_fft 
calls, which makes introducing the interface easy, since we may
use all the existing tests and can verify that introducing the interface 
does not change a bit in the output.


2. Once validated, we can now introduce/activate the use of the arm DSP 
FFT with some glue code to map between the abstract interface and the 
arm DSP interface. Here again we have to validate everything is working 
nicely but we will see some slight differences I assume. However, with 
it we can produce reference data for step 3


3. Now we modify the existing code so that we can benefit from some nice 
properties of the arm DSP fft (inplace FFT) which means this will reduce 
RAM usage significantly (in relation to 192K RAM).


4. We enable optional use of RAM instead of flash in the ARM code, so 
that depending on the amount of available memory you can get some extra 
boost.


For that to work nicely, we have to fix some issues in the existing code 
first, so here comes


0. As Glen pointed out, some of the #define constant have not so good 
names, especially M (defined for 2 different purposes in defines.h and 
fdmdv_internal.h) is nasty and also N in defines.h (there are some local 
variables N and the stm headers also get confused by it). So we need to 
change these to something unambiguous. I think Glen already suggest 
names for them.



And I would like to point out, that the use of dynamic memory allocation 
(malloc/free) is necessary in our mcHF case, so I would like to keep 
this more or less as it is. The mcHF needs the ability to reuse the 
memory for other operational modes, if FreeDV is not active. Which does 
not mean I am against removing the internal use of malloc, but then it 
should be possible to easily create the required data structures 
"outside" the code using malloc. I.e. the use of static data structures 
for anything but const data is a no go.


To support that discuss I created a draft suggestion for the interface 
(attached to this mail). It is right now defined using inline code for 
the sake of simplicity. This may change later, I don't think there 
should be any issue with that. It essentially contains 3 functions for 
complex fft and real fft each (alloc,fft,free) and the necessary data 
structures.


Danilo




Am 16.09.2016 um 20:59 schrieb Dana Myers:



Subject:Re: [Freetel-codec2] more benching and thoughts
Date:   Fri, 16 Sep 2016 12:13:05 +1000
From:   glen english 
Reply-To:   freetel-codec2@lists.sourceforge.net
To: freetel-codec2@lists.sourceforge.net



Hi Danilo
Yeah, I guess being a very bare metal programmer from the old 128 byte
RAM days, , I dislike MALLOCs in embedded code on principal.


I'm similar, though now that we have 32kB, 64kB (or even more) RAM
in embedded chips, they're basically like the systems that malloc()
was initially built on :-)

I still don't trust C++ heap allocators in embedded applications, though.


However, because the heap usage would be deterministic, it should be
fairly safe.
I have a similar project; a 1200 baud modem + TNC stack built in a 
PSoC 5LP.
I use the Delta-Sigma ADC @ 9600s/s , 16-bits, the PSoC Digital Filter 
Block
to do bandpass heavy-lifting and frequency response correction for the 
ADC.
I use CMSIS-DSP for the rest of the DSP crunching required, and, wait 
for it,

wrap the whole thing in FreeRTOS (9.0.0 now). I use q31_t for all the DSP,
as long as I'm 

Re: [Freetel-codec2] cascaded ulaw, Alaw and,AMBE etc

2016-09-17 Thread Stuart Longland
On 17/09/16 18:28, glen english wrote:
> Yeah.
> 
> Cascading codecs is always trouble.
> 
> BTW My understand of the word TRANSCODING is going between one encoded 
> method and another without going back all the way to uncompressed. Going 
> between different video encoding methods usually done by transcoding. 
> Most video encoding is all DCT / macroblock based so is probably more 
> relationship between codecs than speech codec variations
> 
> 
> In this problematic voip case:  uncoded PCM (microphone)  >> AMBE2 >> 
> over air >> decoded >> PCM >> uLaw encode>> ulaw decode >> headset. (and 
> reverse) .

Transcoding is going from one format to another, and when the
destination is a lossy format, that will mean a reduction in quality.

Decoding to uncompressed shouldn't incur loss compared with a
hypothetical conversion from AMBE2 direct to µ-law, more likely the
assumptions made by the µ-law CODEC don't hold true for the synthesized
voice from the AMBE2 CODEC, and that would be why it sounds so terrible.

Regards,
-- 
Stuart Longland (aka Redhatter, VK4MSL)

I haven't lost my mind...
  ...it's backed up on a tape somewhere.

--
___
Freetel-codec2 mailing list
Freetel-codec2@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freetel-codec2