Re: [flac-dev] About SSE intrinsincs in decoder

2022-07-08 Thread Martijn van Beurden
Op do 7 jul. 2022 om 09:46 schreef olivier tristan :

> Perhaps it is possible to add a switch to the encoder to create FLAC files
> that are optimized for decoding speed instead of size. Would that be
> something you would use? For example trading in 5% less compression against
> 30% more decoding speed, assuming that MD5 checking is already off?
>
> This would indeed be interesting.
>

I'll keep that in mind for the future. Thanks for explaining.
___
flac-dev mailing list
flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev


Re: [flac-dev] About SSE intrinsincs in decoder

2022-07-07 Thread olivier tristan

Le 07/07/2022 à 09:34, Martijn van Beurden a écrit :


Op do 7 jul. 2022 om 09:07 schreef olivier tristan :
> Hence even small optimization are very welcomed :)

I presume you use libFLAC directly then. Sadly there is little left to 
optimize in the decoder. Below is an excerpt of the output of gprof on 
flac decoding a track


>  %   cumulative   self              self     total
> time   seconds   seconds    calls   s/call   s/call  name
> 34.87      0.68     0.68   680925     0.00     0.00 
 FLAC__bitreader_read_rice_signed_block

> 25.64      1.18     0.50  6004826     0.00     0.00  FLAC__MD5Transform
> 14.36      1.46     0.28    46030     0.00     0.00 
 FLAC__lpc_restore_signal

>  8.72      1.63     0.17    23457     0.00     0.00  read_frame_
>  5.13      1.73     0.10    23457     0.00     0.00  write_callback
>  3.08      1.79     0.06    23457     0.00     0.00  FLAC__MD5Accumulate
>  3.08      1.85     0.06                             read
>  2.56      1.90     0.05    50901     0.00     0.00 
 FLAC__crc16_update_words32
>  1.03      1.92     0.02    23457     0.00     0.00 
 write_audio_frame_to_client_
>  0.51      1.93     0.01  2016520     0.00     0.00 
 bitreader_read_from_client_

>  0.51      1.94     0.01 _IO_file_seekoff
>  0.51      1.95     0.01 write

As you can see, the bitreader takes up most time. This is however not 
something that can be optimized with SIMD/vector instructions like 
SSE, AVX, NEON etc. It is also strictly a sequential process. In the 
past there have been several attempts at improving speed of this call. 
You could try for yourself configuring using ./configure 
--enable-64-bit-words or cmake -DENABLE_64_BIT_WORDS=ON whether that 
brings any (small) improvement.


Next the MD5 transformation takes up a lot of time too, but I suppose 
you do not use that anyway. It is disabled by default when decoding 
using libFLAC directly.


Finally the lpc restore takes up some time and can be improved with 
SSE, AVX, NEON etc., but it represents only a small part of the 
decoding CPU load.




We use libflac directly indeed so MD5 is not enabled in my case.

We indeed see in the perf analyzer 
FLAC__bitreader_read_rice_signed_block and FLAC__lpc_restore_signal


Perhaps it is possible to add a switch to the encoder to create FLAC 
files that are optimized for decoding speed instead of size. Would 
that be something you would use? For example trading in 5% less 
compression against 30% more decoding speed, assuming that MD5 
checking is already off?

This would indeed be interesting.

The material we use are very well compressed by FLAC as this is just a 
single note of an instrument as opposed to a song.


For example in a piano library, we can divide the sample size by 4.


--
Olivier Tristan
Research & Development
www.uvi.net
___
flac-dev mailing list
flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev


Re: [flac-dev] About SSE intrinsincs in decoder

2022-07-07 Thread Martijn van Beurden
Op do 7 jul. 2022 om 09:07 schreef olivier tristan :
> Hence even small optimization are very welcomed :)

I presume you use libFLAC directly then. Sadly there is little left to
optimize in the decoder. Below is an excerpt of the output of gprof on flac
decoding a track

>  %   cumulative   self  self total
> time   seconds   secondscalls   s/call   s/call  name
> 34.87  0.68 0.68   680925 0.00 0.00
 FLAC__bitreader_read_rice_signed_block
> 25.64  1.18 0.50  6004826 0.00 0.00  FLAC__MD5Transform
> 14.36  1.46 0.2846030 0.00 0.00
 FLAC__lpc_restore_signal
>  8.72  1.63 0.1723457 0.00 0.00  read_frame_
>  5.13  1.73 0.1023457 0.00 0.00  write_callback
>  3.08  1.79 0.0623457 0.00 0.00  FLAC__MD5Accumulate
>  3.08  1.85 0.06 read
>  2.56  1.90 0.0550901 0.00 0.00
 FLAC__crc16_update_words32
>  1.03  1.92 0.0223457 0.00 0.00
 write_audio_frame_to_client_
>  0.51  1.93 0.01  2016520 0.00 0.00
 bitreader_read_from_client_
>  0.51  1.94 0.01 _IO_file_seekoff
>  0.51  1.95 0.01 write

As you can see, the bitreader takes up most time. This is however not
something that can be optimized with SIMD/vector instructions like SSE,
AVX, NEON etc. It is also strictly a sequential process. In the past there
have been several attempts at improving speed of this call. You could try
for yourself configuring using ./configure --enable-64-bit-words or cmake
-DENABLE_64_BIT_WORDS=ON whether that brings any (small) improvement.

Next the MD5 transformation takes up a lot of time too, but I suppose you
do not use that anyway. It is disabled by default when decoding using
libFLAC directly.

Finally the lpc restore takes up some time and can be improved with SSE,
AVX, NEON etc., but it represents only a small part of the decoding CPU
load.

Perhaps it is possible to add a switch to the encoder to create FLAC files
that are optimized for decoding speed instead of size. Would that be
something you would use? For example trading in 5% less compression against
30% more decoding speed, assuming that MD5 checking is already off?
___
flac-dev mailing list
flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev


Re: [flac-dev] About SSE intrinsincs in decoder

2022-07-07 Thread olivier tristan
I work on virtual music instrument ( uvi.net ) and we store all the 
samples in FLAC format because of its compression level, its lossless 
capability and


its decoding speed. This is clearly the best in the world and we love 
it. We have a streaming engine tailored to its use.


In the case of a piano library, we open around 10.000 files at loading 
time to read hundred of ms of it the file beginning as preload to avoid 
latency when the user start playing.


Then depending on what the users play, we could read around 400 files at 
the same time and even more in our new engine where we could up to 2000 
files where we multi thread the audio rendering.


Hence even small optimization are very welcomed :)

I don't mind losing the 32bits x86 optimization are we don't have users 
on 32 bits Intel platform.


I wouldn't even mind paying someone for better optimization in neon for 
example.


Hope this clarify things.

Le 06/07/2022 à 20:36, Martijn van Beurden a écrit :

Olivier,

On a more general note, do you experience the decoding speed of 
libFLAC as a bottleneck in your application? Decoding speed of FLAC is 
already best-in-class among lossless audio codecs, so I actually 
wasn't expecting anyone to object to a (small) decrease in decoding 
speed. There have been more changes recently that could (slightly) 
affect decoding speed, but for good reasons. I am aware that the 
following changes affect decoding speed
- 
https://github.com/xiph/flac/commit/1bec35e33757fc38261b0acfa3c032e720d2baf0
- 
https://github.com/xiph/flac/commit/63ac1c37bebbda5ca61ad5a05a1d8fba2883f629


Comparing current git to release 1.3.4 for my 64-bit x86 machine this 
adds up to about 4% slowdown for both 16-bit and 24-bit audio. When 
using a 32-bit compile, for 16-bit audio this cancels out against the 
speed gain from the removal of optimizations and for 24-bit audio this 
compounds to about 10% speed loss. Numbers may vary depending on the 
CPU specifics. I didn't think this would be a problem, but please 
speak up (and if possible elaborate on the details) if it is.


--
Olivier Tristan
Research & Development
www.uvi.net

___
flac-dev mailing list
flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev


Re: [flac-dev] About SSE intrinsincs in decoder

2022-07-06 Thread Martijn van Beurden
Op wo 6 jul. 2022 om 20:36 schreef Martijn van Beurden :
>
> I am aware that the following changes affect decoding speed
> - https://github.com/xiph/flac/commit/1bec35e33757fc38261b0acfa3c032e720d2baf0
> - https://github.com/xiph/flac/commit/63ac1c37bebbda5ca61ad5a05a1d8fba2883f629
>

Sorry, small mistake in the last email, commits are
- https://github.com/xiph/flac/commit/1bec35e33757fc38261b0acfa3c032e720d2baf0
- https://github.com/xiph/flac/commit/1793632ee6988deb933ff7551fee92134622b558
___
flac-dev mailing list
flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev


Re: [flac-dev] About SSE intrinsincs in decoder

2022-07-06 Thread Martijn van Beurden
Olivier,

On a more general note, do you experience the decoding speed of libFLAC as
a bottleneck in your application? Decoding speed of FLAC is already
best-in-class among lossless audio codecs, so I actually wasn't expecting
anyone to object to a (small) decrease in decoding speed. There have been
more changes recently that could (slightly) affect decoding speed, but for
good reasons. I am aware that the following changes affect decoding speed
-
https://github.com/xiph/flac/commit/1bec35e33757fc38261b0acfa3c032e720d2baf0
-
https://github.com/xiph/flac/commit/63ac1c37bebbda5ca61ad5a05a1d8fba2883f629

Comparing current git to release 1.3.4 for my 64-bit x86 machine this adds
up to about 4% slowdown for both 16-bit and 24-bit audio. When using a
32-bit compile, for 16-bit audio this cancels out against the speed gain
from the removal of optimizations and for 24-bit audio this compounds to
about 10% speed loss. Numbers may vary depending on the CPU specifics. I
didn't think this would be a problem, but please speak up (and if possible
elaborate on the details) if it is.
___
flac-dev mailing list
flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev


Re: [flac-dev] About SSE intrinsincs in decoder

2022-07-05 Thread Martijn van Beurden
Op di 5 jul. 2022 om 09:41 schreef olivier tristan :

> You do not talk about the SSE 4.1 version in your bench.
>
> Have you tried this use case ?
>

I compared 4 compiles: one without any changes (so with all variants of the
lpc functions, including the SSE4.1 ones) and three with variants of plain
C code. As both CPUs that were tested had SSE4.1 capability, these
functions were compared with. So yes, current GCC outperforms those SSE4.1
intrinsics functions on 16-bit inputs and comes close on 24-bit inputs.
___
flac-dev mailing list
flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev


Re: [flac-dev] About SSE intrinsincs in decoder

2022-07-05 Thread olivier tristan

You do not talk about the SSE 4.1 version in your bench.

Have you tried this use case ?

Thanks !

Le 04/07/2022 à 19:23, Martijn van Beurden a écrit :

Op ma 4 jul. 2022 om 15:06 schreef olivier tristan :

While I can understand the rationale for manual assembly as 32
bits x86
is dead, it seems a greater deal to remove all optimization including
intrinsic ones.


Yes, it does seem a great deal to remove all optimization, but it 
really isn't. See the pull request associated with that change for 
more information: https://github.com/xiph/flac/pull/347 I did quite a 
bit of testing before merging this change, on two different CPUs, each 
with 3 different compilers, each with 4 variants of the 
non-intrinsics-accelerated functions. It turns out that there is no 
performance loss at all, and in many cases this change makes flac 
actually faster, not slower as one would expect.


Maybe there should be a an opt in if you don't want to be included by
default but some people including me don't want to see those
optimization been removed ?


There would be no advantage of that over keeping the original code: it 
still needs to be maintained and tested, even if it is hidden behind 
some configuration option. The only case where this patch could be 
problematic in terms of speed is when one compiles flac to be used on 
CPUs that do not support SSE2.


--
Olivier Tristan
Research & Development
www.uvi.net
___
flac-dev mailing list
flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev


Re: [flac-dev] About SSE intrinsincs in decoder

2022-07-04 Thread Martijn van Beurden
Op ma 4 jul. 2022 om 15:06 schreef olivier tristan :

> While I can understand the rationale for manual assembly as 32 bits x86
> is dead, it seems a greater deal to remove all optimization including
> intrinsic ones.
>

Yes, it does seem a great deal to remove all optimization, but it really
isn't. See the pull request associated with that change for more
information: https://github.com/xiph/flac/pull/347 I did quite a bit of
testing before merging this change, on two different CPUs, each with 3
different compilers, each with 4 variants of the non-intrinsics-accelerated
functions. It turns out that there is no performance loss at all, and in
many cases this change makes flac actually faster, not slower as one would
expect.


> Maybe there should be a an opt in if you don't want to be included by
> default but some people including me don't want to see those
> optimization been removed ?
>

There would be no advantage of that over keeping the original code: it
still needs to be maintained and tested, even if it is hidden behind some
configuration option. The only case where this patch could be problematic
in terms of speed is when one compiles flac to be used on CPUs that do not
support SSE2.
___
flac-dev mailing list
flac-dev@xiph.org
http://lists.xiph.org/mailman/listinfo/flac-dev