Re: [flac-dev] About SSE intrinsincs in decoder
Op do 7 jul. 2022 om 09:46 schreef olivier tristan : > Perhaps it is possible to add a switch to the encoder to create FLAC files > that are optimized for decoding speed instead of size. Would that be > something you would use? For example trading in 5% less compression against > 30% more decoding speed, assuming that MD5 checking is already off? > > This would indeed be interesting. > I'll keep that in mind for the future. Thanks for explaining. ___ flac-dev mailing list flac-dev@xiph.org http://lists.xiph.org/mailman/listinfo/flac-dev
Re: [flac-dev] About SSE intrinsincs in decoder
Le 07/07/2022 à 09:34, Martijn van Beurden a écrit : Op do 7 jul. 2022 om 09:07 schreef olivier tristan : > Hence even small optimization are very welcomed :) I presume you use libFLAC directly then. Sadly there is little left to optimize in the decoder. Below is an excerpt of the output of gprof on flac decoding a track > % cumulative self self total > time seconds seconds calls s/call s/call name > 34.87 0.68 0.68 680925 0.00 0.00 FLAC__bitreader_read_rice_signed_block > 25.64 1.18 0.50 6004826 0.00 0.00 FLAC__MD5Transform > 14.36 1.46 0.28 46030 0.00 0.00 FLAC__lpc_restore_signal > 8.72 1.63 0.17 23457 0.00 0.00 read_frame_ > 5.13 1.73 0.10 23457 0.00 0.00 write_callback > 3.08 1.79 0.06 23457 0.00 0.00 FLAC__MD5Accumulate > 3.08 1.85 0.06 read > 2.56 1.90 0.05 50901 0.00 0.00 FLAC__crc16_update_words32 > 1.03 1.92 0.02 23457 0.00 0.00 write_audio_frame_to_client_ > 0.51 1.93 0.01 2016520 0.00 0.00 bitreader_read_from_client_ > 0.51 1.94 0.01 _IO_file_seekoff > 0.51 1.95 0.01 write As you can see, the bitreader takes up most time. This is however not something that can be optimized with SIMD/vector instructions like SSE, AVX, NEON etc. It is also strictly a sequential process. In the past there have been several attempts at improving speed of this call. You could try for yourself configuring using ./configure --enable-64-bit-words or cmake -DENABLE_64_BIT_WORDS=ON whether that brings any (small) improvement. Next the MD5 transformation takes up a lot of time too, but I suppose you do not use that anyway. It is disabled by default when decoding using libFLAC directly. Finally the lpc restore takes up some time and can be improved with SSE, AVX, NEON etc., but it represents only a small part of the decoding CPU load. We use libflac directly indeed so MD5 is not enabled in my case. We indeed see in the perf analyzer FLAC__bitreader_read_rice_signed_block and FLAC__lpc_restore_signal Perhaps it is possible to add a switch to the encoder to create FLAC files that are optimized for decoding speed instead of size. Would that be something you would use? For example trading in 5% less compression against 30% more decoding speed, assuming that MD5 checking is already off? This would indeed be interesting. The material we use are very well compressed by FLAC as this is just a single note of an instrument as opposed to a song. For example in a piano library, we can divide the sample size by 4. -- Olivier Tristan Research & Development www.uvi.net ___ flac-dev mailing list flac-dev@xiph.org http://lists.xiph.org/mailman/listinfo/flac-dev
Re: [flac-dev] About SSE intrinsincs in decoder
Op do 7 jul. 2022 om 09:07 schreef olivier tristan : > Hence even small optimization are very welcomed :) I presume you use libFLAC directly then. Sadly there is little left to optimize in the decoder. Below is an excerpt of the output of gprof on flac decoding a track > % cumulative self self total > time seconds secondscalls s/call s/call name > 34.87 0.68 0.68 680925 0.00 0.00 FLAC__bitreader_read_rice_signed_block > 25.64 1.18 0.50 6004826 0.00 0.00 FLAC__MD5Transform > 14.36 1.46 0.2846030 0.00 0.00 FLAC__lpc_restore_signal > 8.72 1.63 0.1723457 0.00 0.00 read_frame_ > 5.13 1.73 0.1023457 0.00 0.00 write_callback > 3.08 1.79 0.0623457 0.00 0.00 FLAC__MD5Accumulate > 3.08 1.85 0.06 read > 2.56 1.90 0.0550901 0.00 0.00 FLAC__crc16_update_words32 > 1.03 1.92 0.0223457 0.00 0.00 write_audio_frame_to_client_ > 0.51 1.93 0.01 2016520 0.00 0.00 bitreader_read_from_client_ > 0.51 1.94 0.01 _IO_file_seekoff > 0.51 1.95 0.01 write As you can see, the bitreader takes up most time. This is however not something that can be optimized with SIMD/vector instructions like SSE, AVX, NEON etc. It is also strictly a sequential process. In the past there have been several attempts at improving speed of this call. You could try for yourself configuring using ./configure --enable-64-bit-words or cmake -DENABLE_64_BIT_WORDS=ON whether that brings any (small) improvement. Next the MD5 transformation takes up a lot of time too, but I suppose you do not use that anyway. It is disabled by default when decoding using libFLAC directly. Finally the lpc restore takes up some time and can be improved with SSE, AVX, NEON etc., but it represents only a small part of the decoding CPU load. Perhaps it is possible to add a switch to the encoder to create FLAC files that are optimized for decoding speed instead of size. Would that be something you would use? For example trading in 5% less compression against 30% more decoding speed, assuming that MD5 checking is already off? ___ flac-dev mailing list flac-dev@xiph.org http://lists.xiph.org/mailman/listinfo/flac-dev
Re: [flac-dev] About SSE intrinsincs in decoder
I work on virtual music instrument ( uvi.net ) and we store all the samples in FLAC format because of its compression level, its lossless capability and its decoding speed. This is clearly the best in the world and we love it. We have a streaming engine tailored to its use. In the case of a piano library, we open around 10.000 files at loading time to read hundred of ms of it the file beginning as preload to avoid latency when the user start playing. Then depending on what the users play, we could read around 400 files at the same time and even more in our new engine where we could up to 2000 files where we multi thread the audio rendering. Hence even small optimization are very welcomed :) I don't mind losing the 32bits x86 optimization are we don't have users on 32 bits Intel platform. I wouldn't even mind paying someone for better optimization in neon for example. Hope this clarify things. Le 06/07/2022 à 20:36, Martijn van Beurden a écrit : Olivier, On a more general note, do you experience the decoding speed of libFLAC as a bottleneck in your application? Decoding speed of FLAC is already best-in-class among lossless audio codecs, so I actually wasn't expecting anyone to object to a (small) decrease in decoding speed. There have been more changes recently that could (slightly) affect decoding speed, but for good reasons. I am aware that the following changes affect decoding speed - https://github.com/xiph/flac/commit/1bec35e33757fc38261b0acfa3c032e720d2baf0 - https://github.com/xiph/flac/commit/63ac1c37bebbda5ca61ad5a05a1d8fba2883f629 Comparing current git to release 1.3.4 for my 64-bit x86 machine this adds up to about 4% slowdown for both 16-bit and 24-bit audio. When using a 32-bit compile, for 16-bit audio this cancels out against the speed gain from the removal of optimizations and for 24-bit audio this compounds to about 10% speed loss. Numbers may vary depending on the CPU specifics. I didn't think this would be a problem, but please speak up (and if possible elaborate on the details) if it is. -- Olivier Tristan Research & Development www.uvi.net ___ flac-dev mailing list flac-dev@xiph.org http://lists.xiph.org/mailman/listinfo/flac-dev
Re: [flac-dev] About SSE intrinsincs in decoder
Op wo 6 jul. 2022 om 20:36 schreef Martijn van Beurden : > > I am aware that the following changes affect decoding speed > - https://github.com/xiph/flac/commit/1bec35e33757fc38261b0acfa3c032e720d2baf0 > - https://github.com/xiph/flac/commit/63ac1c37bebbda5ca61ad5a05a1d8fba2883f629 > Sorry, small mistake in the last email, commits are - https://github.com/xiph/flac/commit/1bec35e33757fc38261b0acfa3c032e720d2baf0 - https://github.com/xiph/flac/commit/1793632ee6988deb933ff7551fee92134622b558 ___ flac-dev mailing list flac-dev@xiph.org http://lists.xiph.org/mailman/listinfo/flac-dev
Re: [flac-dev] About SSE intrinsincs in decoder
Olivier, On a more general note, do you experience the decoding speed of libFLAC as a bottleneck in your application? Decoding speed of FLAC is already best-in-class among lossless audio codecs, so I actually wasn't expecting anyone to object to a (small) decrease in decoding speed. There have been more changes recently that could (slightly) affect decoding speed, but for good reasons. I am aware that the following changes affect decoding speed - https://github.com/xiph/flac/commit/1bec35e33757fc38261b0acfa3c032e720d2baf0 - https://github.com/xiph/flac/commit/63ac1c37bebbda5ca61ad5a05a1d8fba2883f629 Comparing current git to release 1.3.4 for my 64-bit x86 machine this adds up to about 4% slowdown for both 16-bit and 24-bit audio. When using a 32-bit compile, for 16-bit audio this cancels out against the speed gain from the removal of optimizations and for 24-bit audio this compounds to about 10% speed loss. Numbers may vary depending on the CPU specifics. I didn't think this would be a problem, but please speak up (and if possible elaborate on the details) if it is. ___ flac-dev mailing list flac-dev@xiph.org http://lists.xiph.org/mailman/listinfo/flac-dev
Re: [flac-dev] About SSE intrinsincs in decoder
Op di 5 jul. 2022 om 09:41 schreef olivier tristan : > You do not talk about the SSE 4.1 version in your bench. > > Have you tried this use case ? > I compared 4 compiles: one without any changes (so with all variants of the lpc functions, including the SSE4.1 ones) and three with variants of plain C code. As both CPUs that were tested had SSE4.1 capability, these functions were compared with. So yes, current GCC outperforms those SSE4.1 intrinsics functions on 16-bit inputs and comes close on 24-bit inputs. ___ flac-dev mailing list flac-dev@xiph.org http://lists.xiph.org/mailman/listinfo/flac-dev
Re: [flac-dev] About SSE intrinsincs in decoder
You do not talk about the SSE 4.1 version in your bench. Have you tried this use case ? Thanks ! Le 04/07/2022 à 19:23, Martijn van Beurden a écrit : Op ma 4 jul. 2022 om 15:06 schreef olivier tristan : While I can understand the rationale for manual assembly as 32 bits x86 is dead, it seems a greater deal to remove all optimization including intrinsic ones. Yes, it does seem a great deal to remove all optimization, but it really isn't. See the pull request associated with that change for more information: https://github.com/xiph/flac/pull/347 I did quite a bit of testing before merging this change, on two different CPUs, each with 3 different compilers, each with 4 variants of the non-intrinsics-accelerated functions. It turns out that there is no performance loss at all, and in many cases this change makes flac actually faster, not slower as one would expect. Maybe there should be a an opt in if you don't want to be included by default but some people including me don't want to see those optimization been removed ? There would be no advantage of that over keeping the original code: it still needs to be maintained and tested, even if it is hidden behind some configuration option. The only case where this patch could be problematic in terms of speed is when one compiles flac to be used on CPUs that do not support SSE2. -- Olivier Tristan Research & Development www.uvi.net ___ flac-dev mailing list flac-dev@xiph.org http://lists.xiph.org/mailman/listinfo/flac-dev
Re: [flac-dev] About SSE intrinsincs in decoder
Op ma 4 jul. 2022 om 15:06 schreef olivier tristan : > While I can understand the rationale for manual assembly as 32 bits x86 > is dead, it seems a greater deal to remove all optimization including > intrinsic ones. > Yes, it does seem a great deal to remove all optimization, but it really isn't. See the pull request associated with that change for more information: https://github.com/xiph/flac/pull/347 I did quite a bit of testing before merging this change, on two different CPUs, each with 3 different compilers, each with 4 variants of the non-intrinsics-accelerated functions. It turns out that there is no performance loss at all, and in many cases this change makes flac actually faster, not slower as one would expect. > Maybe there should be a an opt in if you don't want to be included by > default but some people including me don't want to see those > optimization been removed ? > There would be no advantage of that over keeping the original code: it still needs to be maintained and tested, even if it is hidden behind some configuration option. The only case where this patch could be problematic in terms of speed is when one compiles flac to be used on CPUs that do not support SSE2. ___ flac-dev mailing list flac-dev@xiph.org http://lists.xiph.org/mailman/listinfo/flac-dev