Re: [xz-devel] [PATCH] xz: Avoid warnings due to memlimit if threads are in auto mode.

2024-02-28 Thread Sebastian Andrzej Siewior
On 2024-02-28 13:00:08 [+0200], Lasse Collin wrote:
> > > There are also messages that are shown when memory limit does affect
> > > compressed output (switching to single-threaded mode and LZMA2
> > > dictionary size adjustment). The verbosity requirement of these
> > > messages isn't being changed now.  
> > 
> > This sounds like you accept this change in principle but are thinking
> > if V_VERBOSE or V_DEBUG is the right thing.
> 
> Me and three other people on IRC think it should be changed but there
> is no consensus yet what exactly is the best (your patch, -v, or -vv).
> This is about the thread count messages only as (since 5.4.0) automatic
> thread count doesn't affect the compressed output.
> 
> There is some discussion also here:
> 
> https://github.com/tukaani-project/xz/issues/89

I see. In that case let me throw this to V_DEBUG Debian wise and sync
with xz upstream once a new release is up or so. I have two packages
that fail because of this and dpkg added a workaround. So instead adding
another workaround to another package I would fix this on the xz side.
Sounds good?

Sebastian



Re: [xz-devel] [PATCH] xz: Avoid warnings due to memlimit if threads are in auto mode.

2024-02-27 Thread Sebastian Andrzej Siewior
On 2024-02-27 19:17:48 [+0200], Lasse Collin wrote:
> Thanks for the patch! We discussed a bit on IRC and everyone thinks
> it's on the right track but we are pondering the implementation details
> still.
> 
> The thread count messages are shown in situations which don't affect
> the compressed output, and thus the importance of these messages isn't
> so high. Originally they were there to reduce the chance of people
> asking why xz isn't using as many threads as requested.

Understood.

> We are considering to simply change those two message() calls to always
> use V_VERBOSE or V_DEBUG instead of the current V_WARNING. So automatic
> vs. manual number of threads wouldn't affect it like it does in your
> patch. Comparing your apporach and this simpler one:
> 
>   + There are scripts that take a user-specified number for
> parallelization and that number is passed to multiple tools, not
> just xz. Keeping xz -T16 silent about thread count reduction can
> make sense in this case.
> 
>   - The silencing could be done with -q as well though.

Wouldn't -q also shut some legitime warnings?

> There are pros and cons between V_VERBOSE and V_DEBUG.
> 
> For (de)compression, a single -v sets V_VERBOSE and actives the
> progress indicator. If the thread count messages are shown at -v, on
> some systems progress indicator usage would get the message about
> reduced thread count as well.
> 
>   + It works as a hint that increasing the memory usage limits manually
> might allow more threads to be used.
> 
>   - If one uses progress indicator frequently, the thread count
> reduction message might become slightly annoying as the information
> is already known by the user.
> 
>   - Progress indicator can be used in non-interactive cases (when
> stderr isn't a terminal). Then xz only prints a final summary per
> file. This likely is not a common use case but the thread count
> messages would be here as well.
> 
> V_DEBUG is set when -v is used twice (-vv).
> 
>   + Regular progress indicator uses wouldn't get extra messages.
> 
>   - A larger number of users might not become aware that they aren't
> getting as many threads as they could because the automatic memory
> usage limit is too low to allow more threads.

Isn't the automatic memory usage accurate? Because then there is hardly
something you could do about it. Maybe kill chrome…
On the other hand if they decompress and they have 32s CPU and xz
reduces it to 16 threads then there is no "loss" if the file has 16
blocks or less :)
Not sure if documenting it in the man-page would help here.

> There are also messages that are shown when memory limit does affect
> compressed output (switching to single-threaded mode and LZMA2
> dictionary size adjustment). The verbosity requirement of these messages
> isn't being changed now.

This sounds like you accept this change in principle but are thinking if
V_VERBOSE or V_DEBUG is the right thing.

Sebastian



[xz-devel] [PATCH] xz: Avoid warnings due to memlimit if threads are in auto mode.

2024-02-26 Thread Sebastian Andrzej Siewior
From: Sebastian Andrzej Siewior 

If threads are automatically selected then it is possible that their
number needs to get reduced in order not to exceed the current memory
limit. This "reducing" forces a warning which is printed on stderr to
inform the user.
The information is probably not something the user would be interested
in since he did not explicitly ask for the additional threads and so any
number of threads would probably do it without raising an eyebrow.
The downside of this warning is that a few testsuites capture the output
of stderr and complain now that something went wrong.

Print the warning about reduced threads only if number is selected
- automatically and asked to be verbose (-v)
- explicit by the user

Signed-off-by: Sebastian Andrzej Siewior 
---
 src/xz/coder.c| 13 +++--
 src/xz/hardware.c |  7 +++
 src/xz/hardware.h |  2 ++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/src/xz/coder.c b/src/xz/coder.c
index 4efaa802b9bbc..e5e30558aedf6 100644
--- a/src/xz/coder.c
+++ b/src/xz/coder.c
@@ -580,8 +580,13 @@ coder_set_compression_settings(void)
message_bug();
 
if (memory_usage <= memory_limit) {
+   enum message_verbosity v = V_WARNING;
+
+   if (hardware_threads_are_automatic())
+   v = V_VERBOSE;
+
// The memory usage is now low enough.
-   message(V_WARNING, _("Reduced the number of "
+   message(v, _("Reduced the number of "
"threads from %s to %s to not exceed "
"the memory usage limit of %s MiB"),
uint64_to_str(
@@ -601,7 +606,11 @@ coder_set_compression_settings(void)
// time the soft limit will never make xz fail and never make
// xz change settings that would affect the compressed output.
if (hardware_memlimit_mtenc_is_default()) {
-   message(V_WARNING, _("Reduced the number of threads "
+   enum message_verbosity v = V_WARNING;
+
+   if (hardware_threads_are_automatic())
+   v = V_VERBOSE;
+   message(v, _("Reduced the number of threads "
"from %s to one. The automatic memory usage "
"limit of %s MiB is still being exceeded. "
"%s MiB of memory is required. "
diff --git a/src/xz/hardware.c b/src/xz/hardware.c
index 952652fecb8d9..c1d54a5910b7a 100644
--- a/src/xz/hardware.c
+++ b/src/xz/hardware.c
@@ -195,6 +195,13 @@ hardware_memlimit_mtenc_get(void)
 }
 
 
+extern bool
+hardware_threads_are_automatic(void)
+{
+   return threads_are_automatic;
+}
+
+
 extern bool
 hardware_memlimit_mtenc_is_default(void)
 {
diff --git a/src/xz/hardware.h b/src/xz/hardware.h
index 25b351e32b195..e4cfe299d2b2d 100644
--- a/src/xz/hardware.h
+++ b/src/xz/hardware.h
@@ -25,6 +25,8 @@ extern uint32_t hardware_threads_get(void);
 /// This can be true even if the number of threads is one.
 extern bool hardware_threads_is_mt(void);
 
+/// Returns true if the number of threads has set automaticaly.
+extern bool hardware_threads_are_automatic(void);
 
 /// Set the memory usage limit. There are separate limits for compression,
 /// decompression (also includes --list), and multithreaded decompression.
-- 
2.43.0




Re: [xz-devel] Testing LZMA_RANGE_DECODER_CONFIG

2024-02-20 Thread Sebastian Andrzej Siewior

On 2024-02-18 22:35:03 [+0200], Lasse Collin wrote:
> The balance between the hottest locations in the decompressor code
> varies depending on the input file. Linux kernel source compresses very
> well (ratio is about 0.10). This reduces the benefit of branchless
> code. On my main computer I still get about 2 % time reduction with =3.

Okay, so the input matters, too. I tried 1GiB urandom (so it does not
compress so well) but that went quicker than expected… Anyway.
I found 3 idle x86 boxes and re-run a test with linux' perf on them and
the arm64 box. I all flavours for the two archives. On RiscV I did the
'xz -t' thing because perf seems not to be supported well or I lack
access.

The task is pinned to a single CPU means the task can't be migrated to
another core and xz observes only one "core" (and does not spawn
threads). So it is single threaded.

Intel(R) Xeon(R) Platinum 8176M CPU:

|  Performance counter stats for './xz_0x000_gcc -t linux-6.7.5.tar.xz' (5 
runs):
|  
|  13.384,81 msec task-clock   #1,000 CPUs 
utilized   ( +-  0,05% )
| 21  context-switches #1,569 /sec  
  ( +-  2,61% )
|  0  cpu-migrations   #0,000 /sec  

|119  page-faults  #8,891 /sec  
  ( +-  0,34% )
| 28.041.975.275  cycles   #2,095 GHz   
  ( +-  0,05% )
| 32.576.330.155  instructions #1,16  insn per 
cycle  ( +-  0,00% )
|  4.304.914.251  branches #  321,627 M/sec 
  ( +-  0,00% )
|567.850.712  branch-misses#   13,19% of all 
branches ( +-  0,02% )
| 
|   13,38558 +- 0,00707 seconds time elapsed  ( +-  0,05% )
|
|  Performance counter stats for './xz_0x003_gcc -t linux-6.7.5.tar.xz' (5 
runs):
| 
|  12.853,67 msec task-clock   #1,000 CPUs 
utilized   ( +-  0,03% )
| 18  context-switches #1,400 /sec  
  ( +-  5,72% )
|  0  cpu-migrations   #0,000 /sec
|220  page-faults  #   17,116 /sec  
  ( +- 45,95% )
| 26.929.223.135  cycles   #2,095 GHz   
  ( +-  0,03% )
| 42.017.609.529  instructions #1,56  insn per 
cycle  ( +-  0,00% )
|  3.226.245.101  branches #  250,998 M/sec 
  ( +-  0,00% )
|299.814.626  branch-misses#9,29% of all 
branches ( +-  0,11% )
| 
|   12,85438 +- 0,00395 seconds time elapsed  ( +-  0,03% )

missed branches dropped, gained instructions but isn per cycle improved.
Less idle cycles. Worth, ~0.5 sec.

|  Performance counter stats for './xz_0x00f_gcc -t linux-6.7.5.tar.xz' (5 
runs):
| 
|  12.872,36 msec task-clock   #1,000 CPUs 
utilized   ( +-  0,01% )
| 17  context-switches #1,321 /sec  
  ( +-  6,55% )
|  0  cpu-migrations   #0,000 /sec
|220  page-faults  #   17,091 /sec  
  ( +- 45,98% )
| 26.968.386.196  cycles   #2,095 GHz   
  ( +-  0,01% )
| 44.566.213.262  instructions #1,65  insn per 
cycle  ( +-  0,00% )
|  2.957.642.049  branches #  229,767 M/sec 
  ( +-  0,00% )
|249.987.257  branch-misses#8,45% of all 
branches ( +-  0,05% )
| 
|   12,87303 +- 0,00115 seconds time elapsed  ( +-  0,01% )

Slightly worse vs previous.

|  Performance counter stats for './xz_0x1f0_gcc -t linux-6.7.5.tar.xz' (5 
runs):
| 
|   9.740,84 msec task-clock   #1,000 CPUs 
utilized   ( +-  0,02% )
| 21  context-switches #2,156 /sec  
  ( +-  6,14% )
|  0  cpu-migrations   #0,000 /sec
|216  page-faults  #   22,175 /sec  
  ( +- 46,95% )
| 20.407.560.821  cycles   #2,095 GHz   
  ( +-  0,02% )
| 34.751.763.859  instructions #1,70  insn per 
cycle  ( +-  0,00% )
|  3.182.093.181  branches #  326,676 M/sec 
  ( +-  0,00% )
|

[xz-devel] Testing LZMA_RANGE_DECODER_CONFIG

2024-02-17 Thread Sebastian Andrzej Siewior
Hi,

I did some testing on !x86. I changed LZMA_RANGE_DECODER_CONFIG to
different values run a test and looked at the MiB/s value. xz_0 means
LZMA_RANGE_DECODER_CONFIG was 0, xz_1 means the define was set to 1. I
touched src/liblzma/lzma/lzma_decoder.c and rebuilt xz. I pinned the
shell to a single CPU and run test for archive (-tv) for one file three
times. This are the results:

arm64 (Lenovo HR350A):
=== xz 5.4.1 ===
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   110 MiB/s   0:12
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   110 MiB/s   0:12
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   110 MiB/s   0:12

=== ./xz_0 ===
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   115 MiB/s   0:12
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   115 MiB/s   0:12
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   115 MiB/s   0:12

=== ./xz_1 ===
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   108 MiB/s   0:12
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   108 MiB/s   0:12
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   108 MiB/s   0:12

=== ./xz_3 ===
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   109 MiB/s   0:12
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   109 MiB/s   0:12
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   109 MiB/s   0:12

=== ./xz_7 ===
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   109 MiB/s   0:12
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   109 MiB/s   0:12
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   109 MiB/s   0:12

=== ./xz_f ===
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   107 MiB/s   0:12
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   107 MiB/s   0:12
linux-6.7.5.tar.xz (1/1)
  100 % 134.9 MiB / 1,386.4 MiB = 0.097   107 MiB/s   0:12



RiscV (HiFive Unmatched)
=== xz 5.4.5 ===
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09730 MiB/s   0:45
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09730 MiB/s   0:46
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09730 MiB/s   0:45

=== ./xz_0 ===
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09732 MiB/s   0:43
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09732 MiB/s   0:43
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09732 MiB/s   0:43

=== ./xz_1 ===
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09731 MiB/s   0:44
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09731 MiB/s   0:44
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09731 MiB/s   0:44

=== ./xz_3 ===
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09730 MiB/s   0:45
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09731 MiB/s   0:45
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09731 MiB/s   0:45

=== ./xz_7 ===
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09731 MiB/s   0:45
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09731 MiB/s   0:44
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09731 MiB/s   0:44

=== ./xz_f ===
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09730 MiB/s   0:46
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09730 MiB/s   0:45
linux-6.7.5.tar.xz (1/1)
  100 % 134,9 MiB / 1.386,4 MiB = 0,09730 MiB/s   0:45


Based on this it looks like the `0' variant is the best one. Is my test
too simple and does not cover "everything / wide range of decodings"?

Sebastian



Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64

2022-11-23 Thread Sebastian Andrzej Siewior
On 2022-11-23 21:12:53 [+0100], To Lasse Collin wrote:
> > It is fine to build *static* liblzma with --disable-symbol-versions on
> > all archs. Debian-specific workaround is fine in the short term but
> > this should be fixed upstream. One method would be to disable the extra
> > symbols on ia64 but that is not a real fix. Perhaps it's not really
> > possible as long as the main build system is Autotools, I don't
> > currently know.
> 
> I'm not sure what other do but it might be reasonable to disable symbol
> versions for static linking/ compile since there should be no need for
> them.
> I kicked a mariadb build on amd64 with liblzma-dev as an addititional
> dependency just to see if it fails.

Just for the protocol: The mariadb build on amd64 with liblzma-dev
installed passed. So this was not it…

Sebastian



Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64

2022-11-23 Thread Sebastian Andrzej Siewior
On 2022-11-23 22:26:39 [+0200], Lasse Collin wrote:
> On 2022-11-23 John Paul Adrian Glaubitz wrote:
> > Well, Debian builds both the static and dynamic libraries in separate
> > steps, so I'm not sure whether the autotools build system would be
> > able to detect that.
> 
> I would assume the separate steps means running configure twice, once
> to disable static build and once to disable shared build.

3x to be exact:
- 1x shared with threads
- 1x static with threads
- 1x non-shared, no threads, no encoders, just xzdec.

There are three build folder in the end. The full gets a make install,
the other get xzdec/liblzma.a extracted.

Sebastian



Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64

2022-11-23 Thread Sebastian Andrzej Siewior
On 2022-11-23 14:33:39 [+0100], John Paul Adrian Glaubitz wrote:
> @Sebastian: Can you do that? Does anything speak against that?

No, let me do that.

> Adrian

Sebastian



Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64

2022-11-22 Thread Sebastian Andrzej Siewior
On 2022-11-22 18:51:49 [+0100], John Paul Adrian Glaubitz wrote:
> Hello!
Hi,

> [ 36%] Linking CXX shared module ha_archive.so
> cd /<>/builddir/storage/archive && /usr/bin/cmake -E 
> cmake_link_script CMakeFiles/archive.dir/link.txt --verbose=1
> /usr/bin/c++ -fPIC -g -O2 -ffile-prefix-map=/<>=. 
> -specs=/usr/share/dpkg/pie-compile.specs -Wformat -Werror=format-security 
> -Wdate-time -D_FORTIFY_SOURCE=2 -Wdate-time -D_FORTIFY_SOURCE=2 -pie -fPIC 
> -fstack-protector --param=ssp-buffer-size=4 -O2 -g -static-libgcc 
> -fno-omit-frame-pointer -fno-strict-aliasing -Wno-uninitialized 
> -fno-omit-frame-pointer -D_FORTIFY_SOURCE=2 -DDBUG_OFF -Wall -Wenum-compare 
> -Wenum-conversion -Wextra -Wformat-security -Wmissing-braces 
> -Wno-format-truncation -Wno-init-self -Wno-nonnull-compare 
> -Wno-unused-parameter -Woverloaded-virtual -Wnon-virtual-dtor -Wvla 
> -Wwrite-strings -specs=/usr/share/dpkg/pie-link.specs -Wl,-z,relro,-z,now 
> -shared  -o ha_archive.so CMakeFiles/archive.dir/azio.c.o 
> CMakeFiles/archive.dir/ha_archive.cc.o  ../../libservices/libmysqlservices.a 
> -lz
> /usr/bin/ld: warning: -z relro ignored
> /usr/bin/ld: ha_archive.so: version node not found for symbol 
> lzma_get_progress@@XZ_5.2
> /usr/bin/ld: failed to set dynamic section sizes: bad value
> collect2: error: ld returned 1 exit status
> make[4]: *** [storage/archive/CMakeFiles/archive.dir/build.make:118: 
> storage/archive/ha_archive.so] Error 1
> make[4]: Leaving directory '/<>/builddir'
> make[3]: *** [CMakeFiles/Makefile2:4913: 
> storage/archive/CMakeFiles/archive.dir/all] Error 2

I'm not sure if this an ia64 issue or something else is missing. Looking
at the symbols:

| bigeasy@yttrium:~$ readelf -W --dyn-syms 
/lib/ia64-linux-gnu/liblzma.so.5|grep lzma_get_progress
|160: 7480   208 FUNCGLOBAL DEFAULT   12 
lzma_get_progress@@XZ_5.2
|161: 7480   208 FUNCGLOBAL DEFAULT   12 
lzma_get_progress@XZ_5.2.2
| bigeasy@yttrium:~$ readelf -W --dyn-syms /usr/bin/xz|grep progress
| 45:  0 FUNCGLOBAL DEFAULT  UND 
lzma_get_progress@XZ_5.2 (8)

The @@ thingy is used in the library to mark the default symbol. So
liblzma provides two lzma_get_progress and default is XZ_5.2. The XZ
binary picked it up properly. Looking around in your build:

| bigeasy@yttrium:~$ readelf -W --dyn-syms 
../glaubitz/mariadb-10.6/mariadb-10.6-10.6.11/builddir/client/mariadb-conv 
|grep lzma_get_progress
|812: 0011c140   208 FUNCGLOBAL DEFAULT   14 
lzma_get_progress@@XZ_5.2
|813: 0011c140   208 FUNCGLOBAL DEFAULT   14 
lzma_get_progress@XZ_5.2.2

This looks like it is staticaly linked against liblzma. I didn't find
lzma_get_progress anywhere else. So it looks like this function isn't
used by mariadb itself but appears due to static linking somewhere and
asks for trouble. I didn't find any reference to lzma_get_progress in
/lib/ia64-linux-gnu/libgcc_s.so.1, /lib/ia64-linux-gnu/libz.so.1.2.13,
ha_archive.cc.o or libmysqlservices.a. This seems to be all that is
passed to the compiler for linking.

Sebastian



Re: [xz-devel] XZ Utils 5.3.3alpha

2022-10-03 Thread Sebastian Andrzej Siewior
On 2022-09-30 20:23:06 [+0300], Lasse Collin wrote:
> On 2022-09-29 Guillem Jover wrote:
> > On Wed, 2022-09-28 at 21:41:59 +0800, Jia Tan wrote:
> > > […] The
> > > interface for liblzma and xz for the multi threaded decoder does not
> > > have any planned changes, so things could probably be developed and
> > > tested using 5.3.3.  
> > 
> > Ah, thanks, that's reassuring then. It's one of the things I was
> > worried about when having to decide whether to merge the patch I've
> > got implementing this support into dpkg. So, once the alpha version
> > has been packaged for Debian experimental, I'll test the patch and
> > commit it.
> 
> There are no planned changes but that isn't a *promise* that there won't
> be any changes before 5.4.0.
>
> I don't track API or ABI compatibility within development releases and
> thus binaries linked against shared liblzma from one alpha/beta release
> won't run with liblzma from the next alpha/beta *if* they depend on
> unstable symbols (symbol versioning stops it). This includes the xz
> binary itself and would include dpkg too if it uses the threaded
> decoder.

That should be no problem. The last alpha version has been uploaded to
debian experimental and it is exactly the place for such things. So dpkg
could be linked against that version in experimental and will never
enter an official release.

Sebastian



Re: [xz-devel] XZ Utils 5.3.3alpha

2022-09-27 Thread Sebastian Andrzej Siewior
On 2022-09-27 21:29:07 [+0800], Jia Tan wrote:
> > Are there any open issues? If not, what needs to be done before the
> > final release can happen?
> 
> The 5.4.0 release that will contain the multi threaded decoder is
> planned for December. The list of open issues related to 5..4.0 in
> general that I am tracking are:
> 
> - Final tweaks to multi threaded decoder (error handling may need
> improvements since the worker threads stay running in some cases when
> they should not).

Okay, so that is what you are tracking. I remember that there was a
stall in the decoding but I don't remember how it played out.

I do remember that I had something for memory allocation/ limit but I
don't remember if we settled on something or if discussion is needed.
Also how many decoding threads make sense, etc.

> - New ARM64 filter needs to be properly coordinated to other xz
> implementations and documented.
> - Converting tests to the new tuktest framework. Most of the tests
> have been written, but they still need to be reviewed.
> - liblzma and xz functionality to convert a string into a filter
> chain. A draft of this is on the mailing list already, but the syntax
> needs finalizing and the code was not polished.
> - A patch for .lz support needs review.
> - A patch for crc64 optimizations needs review.

This reminds me that I once posted a patch to use openssl for the
sha256. 
https://www.mail-archive.com/xz-devel@tukaani.org/msg00429.html

Some distro is using sha256 instead crc64 by default, I don't remember
which one… Not that I care personally ;)

> - Misc. minor bug fixes.
>
> This is everything currently planned. Most things are done and just
> needs review and minor improvements. Don't worry, multi threaded
> decompression will be coming to xz in a stable release very soon!

Okay. That is good to hear. I would like to get it in Debian and have
dpkg support for the upcomming stable release. The earlier the better
since this affects quite a large part of the system. The toolchain
freeze is in January and I think that dpkg is part of it (or people will
probably get very nervous if such a change gets integrated later in the
cycle).

> Jia Tan

Sebastian



Re: [xz-devel] XZ Utils 5.3.3alpha

2022-09-26 Thread Sebastian Andrzej Siewior
On 2022-08-22 19:52:09 [+0300], Lasse Collin wrote:
> XZ Utils 5.3.3alpha is available at . Here is an
> extract from the NEWS file:
> 
> - Added threaded .xz decompressor lzma_stream_decoder_mt().
>   It can use multiple threads with .xz files that have multiple
>   Blocks with size information in Block Headers. The threaded
>   encoder in xz has always created such files.

Are there any open issues? If not, what needs to be done before the
final release can happen?

Sebasttian



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-11 Thread Sebastian Andrzej Siewior
On 2022-03-11 21:59:19 [+0800], Jia Tan wrote:
> > This:
> >
> > diff --git a/src/liblzma/common/stream_decoder_mt.c 
> > b/src/liblzma/common/stream_decoder_mt.c
> > --- a/src/liblzma/common/stream_decoder_mt.c
> > +++ b/src/liblzma/common/stream_decoder_mt.c
> > @@ -786,7 +786,7 @@ read_output_and_wait(struct lzma_stream_coder *coder,
> > if (mythread_cond_timedwait(>cond,
> > >mutex,
> > wait_abs) != 0) {
> > -   ret = LZMA_TIMED_OUT;
> > +   ret = LZMA_OK;
> > break;
> > }
> > } else {
> >
> > Should "fixes" it. At some point the main thread needs to check if the
> > next thread is able to make progress or not and then return LZMA_OK so
> > that the upper layer can figure out that no progress is made. Otherwise
> > it stucks in the LZMA_TIMED_OUT loop.
> 
> This fixes it just for xz, but if no timeout is specified it will
> still deadlock.
> I haven't looked at the code enough to understand how to fix it for both,
> but I will start to look into that.

No, it is for everyone. In the timeout case we need to check if the
first thread in line can make progress. If we don't provide data new
data to the thread and the thread consumed everything it had then the
thread won't make progress. If the function gets invoked with 0 new data
then we should return LZMA_OK at which point the upper layer (between XZ
binary and the library) will notice that no progress is made and return
an error to xz.
The above patch is not correct because if you do it as I suggeted then
it is possible that an error is returned because the thread is slow and
did not yet make progress.

> I was following the conversation about the soft and hard memory limiting.
> If a user wanted decoding to fail if it can't be done multithreaded and update
> the memory limit as needed, that can't be done right now. It's a minor issue
> that only matters for liblzma, but it would be nice to be able to update both
> limits after decoding has started. I don't consider this a bug, more like
> missing a nice to have feature. One easy solution is to add a new API
> function to liblzma to update the soft memory limit for the multithreaded
> decoder and do nothing / return an error on all other coders. I will add
> a patch for this if you guys think it is a good idea.

Maybe I missundertand you. But if you set memlimit_stop to the same
value as memlimit_threading then you have either multi threaded
decompression or none at all right?

> Jia Tan

Sebastian



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-10 Thread Sebastian Andrzej Siewior
On 2022-03-10 22:16:27 [+0800], Jia Tan wrote:
> I started writing tests in the new framework and I found one bug and
> one issue. If you want to check out the tests I have so far, here is a
> link to check out my progress:
> https://github.com/JiaT75/XZ_Utils_Unofficial/tree/test_multithreaded_decoder
> 
> The bug is with truncated xz files. In multithreaded mode, if a file
> has been corrupted and is missing the end, deadlock occurs. An easy
> way to recreate this is by using the truncate command:
> truncate -s 3 some_multiblock_file.xz
> 
> And then:
> xz -dk --verbose some_multiblock_file.xz --threads=2
> 
> This will result in a deadlock in multithreaded decoding, but not an
> error in single threaded decoding.

This:

diff --git a/src/liblzma/common/stream_decoder_mt.c 
b/src/liblzma/common/stream_decoder_mt.c
--- a/src/liblzma/common/stream_decoder_mt.c
+++ b/src/liblzma/common/stream_decoder_mt.c
@@ -786,7 +786,7 @@ read_output_and_wait(struct lzma_stream_coder *coder,
if (mythread_cond_timedwait(>cond,
>mutex,
wait_abs) != 0) {
-   ret = LZMA_TIMED_OUT;
+   ret = LZMA_OK;
break;
}
} else {

Should "fixes" it. At some point the main thread needs to check if the
next thread is able to make progress or not and then return LZMA_OK so
that the upper layer can figure out that no progress is made. Otherwise
it stucks in the LZMA_TIMED_OUT loop.

> The issue is with updating the memlimit with lzma_memlimit_set. As you
> note in your comment in stream_decoder_mt_memconfig there is no way to
> update memlimit_threading. If the memlimit_stop is set very low to
> start, it will force memlimit_threading to be that same small value. I
> could see users wanting to keep memlimit_threading and memlimit_stop
> in sync or have memlimit_threading always be some function of
> memlimit_stop (maybe memlimit_stop / 2 or something). I am not sure
> what the best fix is for this at the moment, but I don't think having
> only one of these values be configurable at runtime is the best idea.
> Especially when the initialization forces memlimit_threading to be at
> most memlimit_stop (which makes sense for almost every situation).

The idea to have on limit to keep things going (no matter what) and the
other to have reasonable limit at which point you don't want threads to
be used.

> I will continue to write more tests and then review the code itself.
> Nice job to both of you for getting this feature as far as it is
> already.
> 
> Jia Tan

Sebastian



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-07 Thread Sebastian Andrzej Siewior
On 2022-03-07 01:08:40 [+0200], Lasse Collin wrote:
> Hello!
Hi,

> I committed something. The liblzma part shouldn't need any big changes,
> I hope. There are a few FIXMEs but some of them might actually be fine
> as is. The xz side is just an initial commit, there isn't even
> --memlimit-threading option yet (I will add it).
> 
> Testing is welcome. It would be nice if someone who has 12-24 hardware
> threads could test if it scales well. One needs a file with like a
> hundred blocks, so with the default xz -6 that means a 2.5 gigabyte
> uncompressed file, smaller if one uses, for example, --block-size=8MiB
> when compressing.

I made
StreamBlocks  CompOffsetUncompOffsetCompSize  
UncompSize  Ratio  Check  Padding
 1   777   0   0   2.386.777.028  
19.540.326.400  0,122  CRC640

one block is 25.165.824.

32 cores:

| $ time ./src/xz/xz -tv tars.tar.xz -T0
| tars.tar.xz (1/1)
|   100 %  2.276,2 MiB / 18,2 GiB = 0,122   1,6 GiB/s   0:11
 
| 
| real0m11,162s
| user5m44,108s
| sys 0m1,988s

256 cores:
| $ time ./src/xz/xz -tv tars.tar.xz -T0
| tars.tar.xz (1/1)
|   100 %  2.276,2 MiB / 18,2 GiB = 0,122   3,4 GiB/s   0:05
 
| 
| real0m5,403s
| user4m0,298s
| sys 0m24,315s

it appears to work :) If I see this right, then the file is too small or
xz too fast but it does not appear that xz manages to create more than
100 threads.

and decompression to disk
| $ time ~bigeasy/xz/src/xz/xz -dvk tars.tar.xz -T0
| tars.tar.xz (1/1)
|   100 %  2.276,2 MiB / 18,2 GiB = 0,122   746 MiB/s   0:24
 
| 
| real0m25,064s
| user3m49,175s
| sys 0m29,748s

appears to block at around 10 to 14 threads or so and then it hangs at the end
until disk I/O finishes. Decent.
Assuming disk I/O is slow, say 10MiB/s, and we would 388 CPUs (blocks/2)
then it would decompress the whole file into memory and stuck on disk
I/O?

In terms of scaling, xz -tv of that same file with with -T1…64:

| CPUS: 1
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 85 MiB/s, 3:38
| 
| real  3m38,047s
| user  3m37,404s
| sys   0m0,626s
| CPUS: 2
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 171 MiB/s, 1:49
| 
| real  1m49,296s
| user  3m41,529s
| sys   0m1,433s
| CPUS: 3
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 256 MiB/s, 1:12
| 
| real  1m12,832s
| user  3m40,929s
| sys   0m1,199s
| CPUS: 4
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 341 MiB/s, 0:54
| 
| real  0m54,616s
| user  3m40,596s
| sys   0m1,161s
| CPUS: 5
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 425 MiB/s, 0:43
| 
| real  0m43,900s
| user  3m41,306s
| sys   0m1,038s
| CPUS: 6
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 510 MiB/s, 0:36
| 
| real  0m36,587s
| user  3m41,527s
| sys   0m1,076s
| CPUS: 7
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 591 MiB/s, 0:31
| 
| real  0m31,568s
| user  3m41,559s
| sys   0m1,079s
| CPUS: 8
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 676 MiB/s, 0:27
| 
| real  0m27,579s
| user  3m42,098s
| sys   0m0,966s
| CPUS: 9
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 758 MiB/s, 0:24
| 
| real  0m24,614s
| user  3m42,318s
| sys   0m1,119s
| CPUS: 10
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 844 MiB/s, 0:22
| 
| real  0m22,111s
| user  3m41,353s
| sys   0m1,152s
| CPUS: 11
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 923 MiB/s, 0:20
| 
| real  0m20,219s
| user  3m43,327s
| sys   0m1,311s
| CPUS: 12
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,0 GiB/s, 0:18
| 
| real  0m18,442s
| user  3m41,710s
| sys   0m1,110s
| CPUS: 13
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,1 GiB/s, 0:17
| 
| real  0m17,067s
| user  3m42,102s
| sys   0m1,176s
| CPUS: 14
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,1 GiB/s, 0:15
| 
| real  0m15,861s
| user  3m41,978s
| sys   0m1,171s
| CPUS: 15
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,2 GiB/s, 0:14
| 
| real  0m14,866s
| user  3m42,247s
| sys   0m1,108s
| CPUS: 16
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,3 GiB/s, 0:13
| 
| real  0m13,936s
| user  3m41,086s
| sys   0m1,017s
| CPUS: 17
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,4 GiB/s, 0:13
| 
| real  0m13,200s
| user  3m42,171s
| sys   0m1,137s
| CPUS: 18
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,5 GiB/s, 0:12
| 
| real  0m12,539s
| user  3m43,286s
| sys   0m1,355s
| CPUS: 19
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,5 GiB/s, 0:11
| 
| real  0m11,949s
| user  3m44,354s
| sys   0m1,111s
| CPUS: 20
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,6 GiB/s, 0:11
| 
| real  0m11,216s
| user  3m42,635s
| sys   0m1,202s
| CPUS: 21
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,7 GiB/s, 0:10
| 
| real  0m10,655s
| user  3m41,742s
| sys   0m1,123s
| CPUS: 22
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,8 GiB/s, 0:10
| 
| real  0m10,232s
| user  3m42,328s
| sys   0m1,211s
| CPUS: 23
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,9 GiB/s, 0:09
| 
| real  

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-02-07 Thread Sebastian Andrzej Siewior
On 2022-02-07 01:46:32 [+0200], Lasse Collin wrote:
> I now plan to use memlimit_threading and memlimit_stop in the lzma_mt
> structure. Documentation is still needed but hopefully those are a bit
> more obvious.

oki

…

[ long update ]

> I added support for lzma_get_progress(). It is needed to make xz -v
> progress indicator useful.

Thank you for the update.

Sebastian



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-12-31 Thread Sebastian Andrzej Siewior
On 2021-12-15 23:33:58 [+0200], Lasse Collin wrote:
> Yes. It's fairly simple from implementation point of view but is it
> clear enough for the users, I'm not sure.
> 
> I suppose the alternative is having just one limit value and a flag to
> tell if it is a soft limit (so no limit for single-threaded case) or a
> hard limit (return LZMA_MEM_ERROR if too low for even single thread).
> Having separate soft and hard limits instead can achieve the same and a
> little more, so I think I'll choose the two-value approach and hope it's
> clear enough for users.

The value approach might work. I'm not sure if the term `soft' and
`hard' are good here. Using `memlimit' and `memlimit_threaded' (or so)
might make more obvious and easier to understand.
But then this just some documentation that needs to be read and
understood so maybe `softlimit' and `hardlimit' will work just fine.

> I was hoping to get this finished by Christmas but due to a recent sad
> event, late January is my target for the next alpha release now. I hope
> to include a few other things too, including some of Jia Tan's patches
> (we've chatted outside the xz-devel list). Thank you for understanding.

Okay.

Sebastian



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-12-04 Thread Sebastian Andrzej Siewior
On 2021-11-30 00:25:11 [+0200], Lasse Collin wrote:
> Hello!
Hello,

> On 2021-02-05 Sebastian Andrzej Siewior wrote:
> > - Added enum `lzma_memlimit_opt' to lzma_stream_decoder_mt() as an
> >   init parameter. The idea is to specify how to obey the memory limit
> >   so the user can keep using one API and not worry to fail due to the
> >   memory limit. Lets assume the archive has a 9MiB dictionary, 24MiB
> >   block of uncompressed data. The archive contains two compressed
> >   blocks of 10 MiB each. Using two threads, the memory requirement is
> >   roughly (9 + 24 + 10) * 2 = 86 MiB
> > 
> >   On a system with 64 MiB of memory with additional 128MiB of swap it
> >   likely leads to the use of (say 30 MiB) swap memory during
> >   decompression which will slow down the whole operation.
> >   The synchronous API would do just fine with only 9 MiB of memory.
> > 
> >   So to not complicate things, invoking lzma_stream_decoder_mt() with
> >   a memory limit of 32 MiB three scenarios are possible:
> >   - LZMA_MEMLIMIT_THREAD
> > One thread requires 43MiB of memory and would exceed the memory
> > limit. However, continue with one thread instead of possible two.
> > 
> >   - LZMA_MEMLIMIT_NO_THREAD
> > One thread requires 43MiB of memory and would exceed the memory
> > limit. Fallback to the synchronous API without buffered input /
> > output memory.
> > 
> >   - LZMA_MEMLIMIT_COMPLETE
> > In this scenario it would behave like LZMA_MEMLIMIT_NO_THREAD.
> > However, with a dictionary size > 32MiB it would abort.
> 
> In the old single-threaded code, if no memory usage limit is specified
> the worst case memory usage with LZMA2 is about 4 GiB (the format
> allows 4 GiB dict although the current encoder only supports 1536 MiB).
> With the threaded decoder it's the same with LZMA_MEMLIMIT_NO_THREAD.
> 
> However, LZMA_MEMLIMIT_THREAD sounds a bit scary. There are no practical
> limits to the block size so there can be a .xz file that makes the
> decoder allocate a huge amount of memory. It doesn't even need to be an
> intentionally malicious file, it just needs to have the size fields
> present. Thus, I think LZMA_MEMLIMIT_THREAD should be removed.
> One-thread multi-threaded mode will still be used with
> LZMA_MEMLIMIT_NO_THREAD if the limit is high enough.

You are right. I can't think of a reason to force one thread.

> LZMA_MEMLIMIT_NO_THREAD should be the default in xz when no memory
> usage limit has been explicitly specified. There needs to be a default
> "soft limit" (the MemAvailable method is such) that will drop xz to
> single-threaded mode if the soft limit is too high for threaded mode
> (even with just one thread).
> 
> LZMA_MEMLIMIT_COMPLETE could be the mode to use when a memlimit is
> explicitly specified (a "hard limit") on the xz command line. This would
> match the existing behavior of the old single-threaded decoder. It
> would be good to have a way to specify a soft limit on the xz command
> line too.
> 
> It could make sense to have both soft and hard limit at the same time
> but perhaps it gets too confusing: Soft limit that would be used to
> restrict the number of threads (and even drop to single-threaded mode)
> and hard limit which can return LZMA_MEMLIMIT_ERROR. If one is fine to
> use 300 MiB in threaded mode but still wants to allow up to 600 MiB in
> case the file *really* requires that much even in single-threaded mode,
> then this would be useful.

I see. This might sounds like advanced usage.
Looking from the package-manager point of view (installs packages and
needs it decompressed), the work flow is "I have here 80MiB of free
memory but I need to have this decompressed so do what needs to be done
in order fulfil your job." So if the dictionary is 128MiB in size, do it
without threading and hope for swap space but this *used* to work before
that, too.

> Separate soft and hard limits might be convenient from implementation
> point of view though. xz would need --memlimit-soft (or some better
> name) which would always have some default value (like MemAvailable).
> The threaded decoder in liblzma would need to take two memlimit values.
> Then there would be no need for an enum (or a flag) to specify the
> memlimit mode (assuming that LZMA_MEMLIMIT_THREAD is removed).

Ah I see. So one would say soft-limit 80MiB, hard-limit 2^60bytes and
would get no threading at all / LZMA_MEMLIMIT_NO_THREAD. And with soft
1GiB, hard 2^60bytes would get the threading mode. (2^60 is made up
no limit).

> Extra idea, maybe useless: The --no-adjust option could be used to
> specify that if the specified number of threads isn't possible due to
>

[xz-devel] Multithreaded decompression for XZ Utils.

2021-11-06 Thread Sebastian Andrzej Siewior
Hi,

just spotted that Christmas is around the corner. I *think* that I've
been a good boy over the year. I plan to keep it that way just to be
sure. Not trying to push my luck here but what are my chances to find
parallel decompression in xz-utils under the christmas tree?

Sebastian



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-08-10 Thread Sebastian Andrzej Siewior
On 2021-07-20 02:09:31 [+0200], Guillem Jover wrote:
> Hi!
Hi Guillem,

> On Fri, 2021-02-05 at 00:11:57 +0100, Sebastian Andrzej Siewior wrote:
> > From: Sebastian Andrzej Siewior 
> 
> I've only skimmer very quickly over the patch, but I've been running
> it on my system in addition to a locally modified dpkg that uses this
> new support, and it seems to be working great. :)
> 
>   
> <https://git.hadrons.org/cgit/debian/dpkg/dpkg.git/log/?h=next/xz-mt-decompressor>

Thanks for letting me know. You used LZMA_MEMLIMIT_NO_THREAD. How
lovely. I added that piece exactly to satisfy dpkg's expectations hoping
that a buildd with 64MiB RAM will behave as it did with the single
threaded implementation :)

> Thanks,
> Guillem

Sebastian



Re: [xz-devel] [PATCH v2] liblzma: Add multi-threaded decoder

2021-04-12 Thread Sebastian Andrzej Siewior
On 2021-04-11 21:48:39 [+0300], Lasse Collin wrote:
> 
> Yes, sorry, I didn't read carefully enough.

no worries.

> After the XZ for Java 1.9 release I wanted a short break but it's
> already been a month. Outside the Java version there are multiple XZ
> things to do (not all on xz-devel) and I think I should do a few of the
> small ones first to shorten the list a little before your big threading
> patch. I know my handling of the threading patch may at this point
> appear both rude and ridiculous but I just don't get enough things
> done, I'm really sorry. :-(

Okay. So last time you threw a few .xz files at me which did improve the
decoder logic which was nice. If you have any of these in your cabinet…

Other than that, I keep waiting until you get to it. As I said earlier,
the Debian gates are closed so for now I don't mind waiting.

I don't know if this "query available memory" qualifies for little
thing. It was part the huge patch but it could be split out. There was
also the 3GiB limit for 32bit architectures I asked for compared to the
~4GiB limit which works only on 32bit xz running on 64bit host. The mips
patch that you merged is kind of in the similar corner (although I don't
remember if the fixed 2:2 split counts for all mips all just a specific
architecture).

Sebastian



[xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-02-04 Thread Sebastian Andrzej Siewior
From: Sebastian Andrzej Siewior 

Changes since v2:
- use outq

- The block is considered as decompressed after LZMA_STREAM_END (not if
  one of the buffer reach their limit).

- If the in-buffer would produce more than allocated for the out-buffer
  then LZMA_DATA_ERROR is returned.

- If `timeout' is set, the expiry time is computed on function entry and
  not before starting to wait for the thread until it makes progress.

- lzma_outq_enable_partial_output() is used unconditionally.

Changes since v1:
- Use the `timeout' parameter / mythread_cond_timedwait() in the main
  thread.
- A little bit of doxygen documentation.
- Added enum `lzma_memlimit_opt' to lzma_stream_decoder_mt() as an init
  parameter. The idea is to specify how to obey the memory limit so the
  user can keep using one API and not worry to fail due to the memory
  limit. Lets assume the archive has a 9MiB dictionary, 24MiB block of
  uncompressed data. The archive contains two compressed blocks of 10
  MiB each. Using two threads, the memory requirement is roughly
 (9 + 24 + 10) * 2 = 86 MiB

  On a system with 64 MiB of memory with additional 128MiB of swap it
  likely leads to the use of (say 30 MiB) swap memory during
  decompression which will slow down the whole operation.
  The synchronous API would do just fine with only 9 MiB of memory.

  So to not complicate things, invoking lzma_stream_decoder_mt() with a
  memory limit of 32 MiB three scenarios are possible:
  - LZMA_MEMLIMIT_THREAD
One thread requires 43MiB of memory and would exceed the memory
limit. However, continue with one thread instead of possible two.

  - LZMA_MEMLIMIT_NO_THREAD
One thread requires 43MiB of memory and would exceed the memory
limit. Fallback to the synchronous API without buffered input /
output memory.

  - LZMA_MEMLIMIT_COMPLETE
In this scenario it would behave like LZMA_MEMLIMIT_NO_THREAD.
However, with a dictionary size > 32MiB it would abort.

- Add get_free_mem() to get free memory on Linux. It reads the value of
  `MemAvailable' from `/proc/meminfo'. The difference compared to
  `lzma_physmem()' is that it returns the amout of memory that the
  system thinks is usable without using swap memory.
  A big machine may have 128GiB of memory but may also have a few
  applications running consuming 88GiB of it. The `free' command may
  report 38 GiB in page cache and 2 GiB in free. The `MemAvailable' will
  report something between 37 - 39 GiB area.

- New threads are only created if memory limit does not exceed
  filter_size + decomp_block * 2 which accounts the output and
  worst-case input (uncompressed data).
  This is to avoid exceeding the memory limit while extending the input
  buffer.

Changes since RFC:
- Options are considered
- Memory limits are considered. The limit may get exceeded if we get
  close to it and then the existing threads enlarge their in-buffer.
- Blocks with no size information in the header can be decompressed.
  This happens synchronous.

Signed-off-by: Sebastian Andrzej Siewior 
---
 src/liblzma/api/lzma/container.h   |   79 +-
 src/liblzma/common/Makefile.inc|6 +
 src/liblzma/common/outqueue.h  |9 +
 src/liblzma/common/stream_decoder_mt.c | 1157 
 src/liblzma/liblzma.map|1 +
 src/xz/coder.c |  110 ++-
 6 files changed, 1357 insertions(+), 5 deletions(-)
 create mode 100644 src/liblzma/common/stream_decoder_mt.c

diff --git a/src/liblzma/api/lzma/container.h b/src/liblzma/api/lzma/container.h
index 4ad7ce6a8ec0d..505aa3c16ffc4 100644
--- a/src/liblzma/api/lzma/container.h
+++ b/src/liblzma/api/lzma/container.h
@@ -173,7 +173,7 @@ typedef struct {
uint32_t reserved_int2;
uint32_t reserved_int3;
uint32_t reserved_int4;
-   uint64_t reserved_int5;
+   uint64_t memlimit;
uint64_t reserved_int6;
uint64_t reserved_int7;
uint64_t reserved_int8;
@@ -587,6 +587,83 @@ extern LZMA_API(lzma_ret) lzma_stream_decoder(
lzma_nothrow lzma_attr_warn_unused_result;
 
 
+/**
+ * \brief   The `memory_limit' argument for lzma_stream_decoder_mt()
+ *
+ * The required memory limit for decompression is determined later once the XZ
+ * block header is parsed and the compression parameters are known.
+ * Each thread needs to allocate the memory for the filters (size of the
+ * dictionary) and a buffer for the compressed and decompressed data.
+ * This option specifies how to proceed if is not possible to continue even 
with
+ * one thread.
+ */
+typedef enum {
+
+
+   LZMA_MEMLIMIT_COMPLETE = 0,
+   /**<
+* \brief   Abort decompression.
+*
+* Decompression will fail if the specified memory limit would be
+* exceeded even with one thread.
+*/
+
+
+   LZMA_MEMLIMIT_THREAD = 1,
+   /**<
+* \brief   Continue with one thread.
+*
+* Should the requir

[xz-devel] [PATCH] partial update, end with LZMA_STREAM_END error

2021-01-29 Thread Sebastian Andrzej Siewior
From: Sebastian Andrzej Siewior 

This is an incremental update against the outq patch. I can send an
all-in-one patch against the current master branch, I hope this makes
the review easier.

I wanted to address all issues raised in the previous mail. The "no
progress" state is hopefully properly addressed. I added `tmp_buf' which
is used if the output buffer is exhausted. Should the additional buffer
be used then it is considered as LZMA_DATA_ERROR. if the block decoder
returns with LZMA_OK. I think bad-1-lzma2-10.xz tried this.

Signed-off-by: Sebastian Andrzej Siewior 
---
 src/liblzma/common/stream_decoder_mt.c | 128 +++--
 1 file changed, 78 insertions(+), 50 deletions(-)

diff --git a/src/liblzma/common/stream_decoder_mt.c 
b/src/liblzma/common/stream_decoder_mt.c
index 8e361c38f27a2..bbe22c461e52f 100644
--- a/src/liblzma/common/stream_decoder_mt.c
+++ b/src/liblzma/common/stream_decoder_mt.c
@@ -45,7 +45,6 @@ struct worker_thread {
/// Bytes consumed of ->in (worker)
size_t in_pos;
 
-   lzma_ret thread_error;
worker_state state;
 
/// Pointer to the main structure is needed when putting this
@@ -98,6 +97,7 @@ struct lzma_stream_coder {
 
/// Current thread compressed data is written to
struct worker_thread *thr_write;
+   lzma_ret thread_error;
 
lzma_outq outq;
 
@@ -142,6 +142,23 @@ static void thr_do_partial_update(void *thr_ptr)
mythread_mutex_unlock(>mutex);
 }
 
+static void worker_set_error(struct worker_thread *thr, lzma_ret err_code)
+{
+   mythread_mutex_lock(>mutex);
+   if (thr->state == THR_RUN)
+   thr->state = THR_IDLE;
+   mythread_mutex_unlock(>mutex);
+
+   mythread_mutex_lock(>coder->mutex);
+   if (thr->coder->thread_error == LZMA_OK)
+   thr->coder->thread_error = err_code;
+
+   thr->next = thr->coder->threads_free;
+   thr->coder->threads_free = thr;
+   mythread_cond_signal(>coder->cond);
+   mythread_mutex_unlock(>coder->mutex);
+}
+
 /// Use smaller chunks so cancellation attempts don't block for long
 #define CHUNK_SIZE 16384
 static MYTHREAD_RET_TYPE worker_decoder(void *thr_ptr)
@@ -149,6 +166,8 @@ static MYTHREAD_RET_TYPE worker_decoder(void *thr_ptr)
struct worker_thread *thr = thr_ptr;
size_t in_filled;
size_t out_pos;
+   unsigned char tmp_buf;
+   size_t tmp_buf_pos = 0;
lzma_ret ret;
 
 next_loop_lock:
@@ -176,12 +195,10 @@ static MYTHREAD_RET_TYPE worker_decoder(void *thr_ptr)
goto next_loop_unlocked;
} else if (thr->state != THR_RUN) {
thr->state = THR_IDLE;
-   thr->thread_error = LZMA_PROG_ERROR;
mythread_mutex_unlock(>mutex);
 
-   mythread_mutex_lock(>coder->mutex);
-   mythread_cond_signal(>coder->cond);
-   mythread_mutex_unlock(>coder->mutex);
+   worker_set_error(thr, LZMA_PROG_ERROR);
+
goto next_loop_lock;
}
 
@@ -199,60 +216,74 @@ static MYTHREAD_RET_TYPE worker_decoder(void *thr_ptr)
 
out_pos = thr->secret_progress;
 
-   ret = thr->block_decoder.code(thr->block_decoder.coder,
- thr->allocator,
- thr->in, >in_pos, in_filled,
- thr->outbuf->buf, _pos, 
thr->outbuf->allocated,
- LZMA_RUN);
-   if (ret == LZMA_OK || ret == LZMA_STREAM_END) {
+   // Check if it attempts to write more than written in the header.
+   if (out_pos == thr->outbuf->allocated) {
+
+   ret = thr->block_decoder.code(thr->block_decoder.coder,
+ thr->allocator,
+ thr->in, >in_pos, in_filled,
+ _buf, _buf_pos, 1,
+ LZMA_RUN);
+   } else {
+   ret = thr->block_decoder.code(thr->block_decoder.coder,
+ thr->allocator,
+ thr->in, >in_pos, in_filled,
+ thr->outbuf->buf, _pos, 
thr->outbuf->allocated,
+ LZMA_RUN);
+   }
+   if (ret == LZMA_OK) {
bool partial_update;
 
mythread_mutex_lock(>mutex);
-   if (thr->in_pos == thr->in_block_size) {
-   if (thr->state == THR_RUN)
-   thr->state = THR_IDLE;
-   }
partial_update = thr->partial_update;
mythread_mutex_unlock(>mute

Re: [xz-devel] [PATCH v2] liblzma: Add multi-threaded decoder

2021-01-27 Thread Sebastian Andrzej Siewior
On 2021-01-24 23:56:15 [+0200], Lasse Collin wrote:
> Hello!
Hi,

> I haven't made much progress with this still, I'm sorry. :-( Below are
> comments about a few small details. It's not much but I will (slowly)
> keep reading and testing.

Thank you.

> (1) Segfault due to thr->outbuf == NULL
> 
> I changed CHUNK_SIZE to 1 to test corner cases. I used
> good-1-block_header-1.xz as the test file. It can segfault in
> worker_decoder() on the line calling thr->block_decoder.code(...)
> because thr->outbuf is NULL (so the problem was introduced in the outq
> patch). This happens because of "thr->outbuf = NULL;" later in the
> function.
> 
> It looks like that it marks the outbuf finished and returns the thread
> to the pool too early or forgets to set thr->state = THR_IDLE. As a
> temporary workaround, I added "thr->state = THR_IDLE;" after
> "thr->outbuf = NULL;".

I moved the logic to reset the in buffer only after LZMA_STREAM_END.

> (2) Block decoder must return LZMA_STREAM_END on success
> 
> Because of end marker and integrity check, the output buffer will be
> full before the last bytes of input have been processed by the Block
> decoder. Thus it is not enough to look at the input and output
> positions to determine when decoding has been finished; only
> LZMA_STREAM_END should be used to determine that decoding was
> successful.
> 
> In theory it is OK to mark the outbuf as finished once the output is
> full but for simplicity I suggest doing so (and returning the thread to
> the pool) only after LZMA_STREAM_END.
> 
> I committed a new test file bad-1-check-crc32-2.xz. The last byte in
> the Block (last byte of Check) is wrong. Change CHUNK_SIZE to 1 and try
> "xz -t -T2 file bad-1-check-crc32-2.xz". The file must be detected to
> be corrupt (LZMA_DATA_ERROR).
> 

I changed that as suggested.

> (3) Bad input where the whole input or output buffer cannot be used
> 
> In the old single-threaded decoding, lzma_code() will eventually return
> LZMA_BUF_ERROR if the calls to lzma_code() cannot make any progress,
> that is, no more input is consumed and no more output is produced. This
> condition can happen with correct code if the input file is corrupt in
> a certain way, for example, a truncated .xz file.
> 
> Since the no-progress detection is centralized in lzma_code(), the
> internal decoders including Block decoder don't try to detect this
> situation. Currently this means that worker_decoder() should detect it
> to catch bad input and prevent hanging on certain malformed Blocks.
> However, since the Block decoder knows both Compressed Size and
> Uncompressed Size, I think I will improve Block decoder instead so
> don't do anything about this for now.
> 
> I committed two test files, bad-1-lzma2-9.xz and bad-1-lzma2-10.xz. The
> -9 may make worker_decoder() not notice that the Block is invalid. The
> -10 makes the decoder hang. Like I said, I might fix these by changing
> the Block decoder.

So I updated the logic that if LZMA_OK is returned from the block
decoder and either the complete in-buffer or out-buffer has been
consumed then it returns LZMA_DATA_ERROR.

> (4) Usage of partial_update in worker_decoder()
> 
> Terminology: main mutex means coder->mutex alias thr->coder->mutex.
> 
> In worker_decoder(), the main mutex is locked every time there is new
> output available in the worker thread. partial_update is only used to
> determine when to signal thr->coder->cond.
> 
> To reduce contention on the main mutex, worker_decoder() could lock it
> only when
>   - decoding of the Block has been finished (successfully or
> unsuccessfully, that is, ret != LZMA_OK), or
>   - there is new output available and partial_update is true; if
> partial_update is false, thr->outbuf->pos is not touched.
> 
> This way only one worker will be frequently locking the main mutex.
> However, I haven't tried it and thus don't know how much this affects
> performance in practice. One possible problem might be that it may
> introduce a small delay in output availability when the main thread
> switches reading from the next outbuf in the list.

I did implement it as described, didn't I?

> (5) Use of mythread_condtime_set()
> 
> In the encoder the absolute time is calculated once per lzma_code()
> call. The comment in wait_for_work() in in stream_encoder_mt.c was
> wrong. The reason the absolute time is calculated once per lzma_code()
> call is to ensure that blocking multiple times won't make the timeout
> ineffective if each blocking takes less than timeout milliseconds. So
> it should be done similarly in the decoder.

oh, okay. I assumed it as in "waiting is okay but no longer than timeout
secs".
It is possible that at SEQ_INDEX time we wait less than `timeout' copy a
few bytes and then wait again less than `timeout' until the out-buffer
is full. But hey, we make progress ;)
At SEQ_BLOCK_HEADER we would wait only once and return on the second
iteration because we made progress. But yes, depending on how much has
been 

Re: [xz-devel] [RFC 2/2] Add xxHash, XX3 (128bit) for hashing.

2021-01-20 Thread Sebastian Andrzej Siewior
On 2021-01-20 00:37:06 [+0100], Sebastian Andrzej Siewior wrote:
> So this is better than crc64 and close to none while doing something ;)

xz -tv -T0 with crc64 reports:

  100 % 10,2 GiB / 40,0 GiB = 0,255   1,1 GiB/s   0:35

and the same archive with xxh3:

  100 % 10,2 GiB / 40,0 GiB = 0,255   1,1 GiB/s   0:34

which looks like it is not worth the trouble.

Sebastian



Re: [xz-devel] [PATCH 1/2] Add support openssl's SHA256 implementation

2021-01-20 Thread Sebastian Andrzej Siewior
On 2021-01-20 00:30:08 [+0100], Sebastian Andrzej Siewior wrote:
> From: Sebastian Andrzej Siewior 
> 
> I created a test file via
>   dd if=/dev/zero bs=1024k count=1024 | xz -v -0 -Csha256

Another data point: xz -tv -T0 went from
   100 % 10,2 GiB / 40,0 GiB = 0,255   921 MiB/s   0:44
to
   100 % 10,2 GiB / 40,0 GiB = 0,255   1,2 GiB/s   0:34   

(I converted my test archive from crc64 to sha256 for testing).

Sebastian



[xz-devel] [RFC 2/2] Add xxHash, XX3 (128bit) for hashing.

2021-01-19 Thread Sebastian Andrzej Siewior
From: Sebastian Andrzej Siewior 

After seeing the numbers of of openssl's sha256 vs intree, vs crc64 vs
none I decided to look at xxHash 16byte version which got stable in the
0.8 version (given I understood the signs right):

| Performance counter stats for './src/xz/.libs/xz -t xxh3.xz' (5 runs):
|
| 8.133.583.414  cycles#4,293 GHz   
   ( +-  0,01% )  (83,22%)
|14.366.854.241  instructions  #1,77  insn per cycle
|
|1,8958 +- 0,0135 seconds time elapsed  ( +-  0,71% )

So this is better than crc64 and close to none while doing something ;)

Signed-off-by: Sebastian Andrzej Siewior 
---
 configure.ac |  6 +-
 src/liblzma/Makefile.am  |  2 +-
 src/liblzma/api/lzma/check.h |  7 +++
 src/liblzma/check/check.c| 27 +--
 src/liblzma/check/check.h| 29 +++--
 src/lzmainfo/Makefile.am |  2 +-
 src/xz/Makefile.am   |  2 +-
 src/xz/args.c|  1 +
 src/xz/list.c|  2 +-
 src/xz/message.c |  2 +-
 src/xzdec/Makefile.am|  2 +-
 11 files changed, 71 insertions(+), 11 deletions(-)

diff --git a/configure.ac b/configure.ac
index 5e0eaefc99c92..2e2bfe113da56 100644
--- a/configure.ac
+++ b/configure.ac
@@ -234,7 +234,7 @@ fi
 # Integrity checks #
 
 
-m4_define([SUPPORTED_CHECKS], [crc32,crc64,sha256])
+m4_define([SUPPORTED_CHECKS], [crc32,crc64,xxh3,sha256])
 
 m4_foreach([NAME], [SUPPORTED_CHECKS],
 [enable_check_[]NAME=no
@@ -798,6 +798,10 @@ if test "x$enable_openssl$openssl_found" = xyesno; then
AC_MSG_ERROR([--enable-openssl was specified but openssl was not 
found.])
 fi
 
+if test "x$enable_check_xxh" = "xyes"; then
+   PKG_CHECK_MODULES([LIBXXHASH], [libxxhash >= 0.8.0])
+fi
+
 # Check for SSE2 intrinsics.
 AC_CHECK_DECL([_mm_movemask_epi8],
[AC_DEFINE([HAVE__MM_MOVEMASK_EPI8], [1],
diff --git a/src/liblzma/Makefile.am b/src/liblzma/Makefile.am
index 3afb08169840c..570a4db23c3f4 100644
--- a/src/liblzma/Makefile.am
+++ b/src/liblzma/Makefile.am
@@ -25,7 +25,7 @@ liblzma_la_CPPFLAGS = \
-I$(top_srcdir)/src/common \
-DTUKLIB_SYMBOL_PREFIX=lzma_
 liblzma_la_LDFLAGS = -no-undefined -version-info 8:99:3
-liblzma_la_LDFLAGS += $(OPENSSL_CRYPTO_LIBS)
+liblzma_la_LDFLAGS += $(OPENSSL_CRYPTO_LIBS) $(LIBXXHASH_LIBS)
 
 EXTRA_DIST += liblzma.map validate_map.sh
 if COND_SYMVERS
diff --git a/src/liblzma/api/lzma/check.h b/src/liblzma/api/lzma/check.h
index 6a243db0d7943..21aa6c8e3f7c9 100644
--- a/src/liblzma/api/lzma/check.h
+++ b/src/liblzma/api/lzma/check.h
@@ -46,6 +46,13 @@ typedef enum {
 * Size of the Check field: 8 bytes
 */
 
+   LZMA_CHECK_XXH3   = 7,
+   /**<
+* xxHash family, XXH3, 128bit
+*
+* Size of the Check field: 16 bytes
+*/
+
LZMA_CHECK_SHA256   = 10
/**<
 * SHA-256
diff --git a/src/liblzma/check/check.c b/src/liblzma/check/check.c
index 428ddaeb77981..2e168ae274c1f 100644
--- a/src/liblzma/check/check.c
+++ b/src/liblzma/check/check.c
@@ -39,7 +39,13 @@ lzma_check_is_supported(lzma_check type)
 
false,  // Reserved
false,  // Reserved
-   false,  // Reserved
+
+#ifdef HAVE_CHECK_XXH3
+   true,
+#else
+   false,
+#endif
+
false,  // Reserved
false,  // Reserved
 
@@ -48,7 +54,6 @@ lzma_check_is_supported(lzma_check type)
 #else
false,
 #endif
-
false,  // Reserved
false,  // Reserved
false,  // Reserved
@@ -99,6 +104,12 @@ lzma_check_init(lzma_check_state *check, lzma_check type)
break;
 #endif
 
+#ifdef HAVE_CHECK_XXH3
+   case LZMA_CHECK_XXH3:
+   lzma_xxh3_init(check);
+   break;
+#endif
+
 #ifdef HAVE_CHECK_SHA256
case LZMA_CHECK_SHA256:
lzma_sha256_init(check);
@@ -130,6 +141,12 @@ lzma_check_update(lzma_check_state *check, lzma_check type,
break;
 #endif
 
+#ifdef HAVE_CHECK_XXH3
+   case LZMA_CHECK_XXH3:
+   lzma_xxh3_update(buf, size, check);
+   break;
+#endif
+
 #ifdef HAVE_CHECK_SHA256
case LZMA_CHECK_SHA256:
lzma_sha256_update(buf, size, check);
@@ -160,6 +177,12 @@ lzma_check_finish(lzma_check_state *check, lzma_check type)
break;
 #endif
 
+#ifdef HAVE_CHECK_XXH3
+   case LZMA_CHECK_XXH3:
+   lzma_xxh3_finish(check);
+   break;
+#endif
+
 #ifdef HAVE_CHECK_SHA256
case LZMA_CHECK_SHA256:
lzma_sha256_finish(check);
diff --git a/src/liblzma/check/check.h b/src/liblzma/check/check.h
index 0249025ec179a..910dc3d55fdca 100644
--- a/src/liblzma/check/check.h
+++ b/sr

[xz-devel] [PATCH 1/2] Add support openssl's SHA256 implementation

2021-01-19 Thread Sebastian Andrzej Siewior
From: Sebastian Andrzej Siewior 

I created a test file via
dd if=/dev/zero bs=1024k count=1024 | xz -v -0 -Csha256

and compared the in-tree sha256 implementation on a Ryzen (CPU
acceleration available):

| Performance counter stats for 'xz --test sha256.xz' (5 runs):
|
|20.748.708.638  cycles#4,174 GHz   
   ( +-  1,23% )  (83,29%)
|63.371.432.190  instructions  #3,05  insn per cycle
|  #0,23  stalled cycles 
per insn  ( +-  0,01% )  (83,37%)
|4,9778 +- 0,0488 seconds time elapsed  ( +-  0,98% )

vs OpenSSL's:

| Performance counter stats for './src/xz/xz --test sha256.xz' (5 runs):
|
|10.037.180.776  cycles#4,230 GHz   
   ( +-  0,03% )  (83,18%)
|16.126.619.033  instructions  #1,61  insn per cycle
|  #0,50  stalled cycles 
per insn  ( +-  0,01% )  (83,43%)
|   2,37200 +- 0,00621 seconds time elapsed  ( +-  0,26% )

worse insn/cycle ratio, much less instructions half run time. It is
even slightly better compared to crc64:

| Performance counter stats for './src/xz/xz --test crc64.xz' (5 runs):
|
|10.989.495.452  cycles#4,250 GHz   
   ( +-  0,04% )  (83,22%)
|17.829.100.301  instructions  #1,62  insn per cycle
|  #0,43  stalled cycles 
per insn  ( +-  0,02% )  (83,42%)
|2,5850 +- 0,0103 seconds time elapsed  ( +-  0,40% )

For the protocol, compared to no checksum:

| Performance counter stats for './src/xz/xz --test none.xz' (5 runs):
|
| 7.857.471.590  cycles#4,237 GHz   
   ( +-  0,03% )  (83,08%)
|13.257.837.157  instructions  #1,69  insn per cycle
|
|   1,85337 +- 0,00440 seconds time elapsed  ( +-  0,24% )

Signed-off-by: Sebastian Andrzej Siewior 
---

I learned here that rpm is using sha256 based checksums. So that might
be a good thing.

 configure.ac  | 24 +++-
 src/liblzma/Makefile.am   |  1 +
 src/liblzma/check/check.h | 33 -
 src/lzmainfo/Makefile.am  |  2 +-
 src/xz/Makefile.am|  2 +-
 src/xzdec/Makefile.am |  2 +-
 6 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/configure.ac b/configure.ac
index 2418e4b039e61..5e0eaefc99c92 100644
--- a/configure.ac
+++ b/configure.ac
@@ -289,6 +289,19 @@ else
AC_MSG_RESULT([no])
 fi
 
+AC_MSG_CHECKING([if openssl should be used])
+AC_ARG_ENABLE([openssl], AS_HELP_STRING([--enable-openssl],
+   [Use openssl from the operating system.
+   See INSTALL for possible subtle problems.]),
+   [], [enable_openssl=no])
+if test "x$enable_openssl" != "xyes"; then
+   enable_openssl=no
+fi
+if test "x$enable_openssl" = xyes; then
+   AC_MSG_RESULT([yes])
+else
+   AC_MSG_RESULT([no])
+fi
 
 ###
 # Assembler optimizations #
@@ -740,6 +753,7 @@ TUKLIB_MBSTR
 sha256_header_found=no
 sha256_type_found=no
 sha256_func_found=no
+openssl_found=no
 if test "x$enable_external_sha256" = "xyes"; then
# Test for Common Crypto before others, because Darwin has sha256.h
# too and we don't want to use that, because on older versions it
@@ -770,11 +784,19 @@ if test "x$enable_external_sha256" = "xyes"; then
[sha256_func_found=yes ; break])
fi
fi
+elif test "x$enable_openssl" = "xyes"; then
+   PKG_CHECK_MODULES([OPENSSL_CRYPTO], [libcrypto],
+ [AC_DEFINE([HAVE_OPENSSL_CRYPTO], [1], [Use SHA256 
from openssl])
+ openssl_found=yes])
 fi
-AM_CONDITIONAL([COND_INTERNAL_SHA256], [test "x$sha256_func_found" = xno])
+
+AM_CONDITIONAL([COND_INTERNAL_SHA256], [test "x$sha256_func_found" = xno -a 
"x$openssl_found" = xno])
 if test "x$enable_external_sha256$sha256_func_found" = xyesno; then
AC_MSG_ERROR([--enable-external-sha256 was specified but no supported 
external SHA-256 implementation was found])
 fi
+if test "x$enable_openssl$openssl_found" = xyesno; then
+   AC_MSG_ERROR([--enable-openssl was specified but openssl was not 
found.])
+fi
 
 # Check for SSE2 intrinsics.
 AC_CHECK_DECL([_mm_movemask_epi8],
diff --git a/src/liblzma/Makefile.am b/src/liblzma/Makefile.am
index 6323e26aade10..3afb08169840c 100644
--- a/src/liblzma/Makefile.am
+++ b/src/liblzma/Makefile.am
@@ -25,6 +25,7 @@ liblzma_la_CPPFLAGS = \
-I$(top_srcdir)/src/common \
-DTUKLIB_SYMBOL_PREFIX=lzma_
 liblzma_la_LDFLAGS = -no-undefined -version-info 8:99:3
+liblzma_la_LDFLAGS += $(O

Re: [xz-devel] [PATCH] xz: Fix setting memory limit on 32-bit systems

2021-01-19 Thread Sebastian Andrzej Siewior
On 2021-01-18 23:52:50 [+0200], Lasse Collin wrote:
> On 2021-01-10 Sebastian Andrzej Siewior wrote:
> > I hope for sane defaults :)
> 
> I hope so too. So far I have felt that the suggested solutions have
> significant flaws or downsides, and I'm not able to see what is a good
> enough compromise. As a result the discussion hasn't progressed much
> and I feel it's partly my fault, sorry. I will try again:
> 
> I have understood that *in practice* the problem with the xz command
> line tool is limited to "xz -T0" usage so fixing this use case is
> enough for most people. Please correct me if I missed something.

Correct.

> The change in XZ Utils 5.2.5 helps a little with 32-bit xz running
> under 64-bit kernel but only if one specifies a memory usage limit like
> -M90% together with -T0. To make plain -T0 work too, in an earlier
> email I suggested that -T0 could also imply a memory usage limit if no
> limit was otherwise specified (a preliminary patch was included too). I
> have been hesitant to make changes to the defaults of the memory usage
> limiter but this solution would only affect a very specific situation
> and thus I feel it would be fine. Comments would be appreciated.

In the parallel decompress I added code on Linux to query the
available memory. I would prefer that as an upper limit on 64bit if no
limit is given. The reason is that *this* amount of memory is safe to
use without over-committing / involving swap.
For 32bit applications I would cap that limit to 2.5 GiB or so. The
reason is that the *normal* case is to run 32bit application on a 32bit
kernel and so likely only 3GiB can be addressed at most (minus a few
details like linked in libs, NULL page, guard pages and so on).
The 32bit application on 64bit kernel is probably a shortcut where
something is done a 32bit chroot - like building a package.

I'm not sure what a sane upper limit is on other OSes. Limitting it on
32bit does probably more good than bad if there is no -M parameter.

> The problem with applications using liblzma and running out of address
> space sounds harder to fix. As I explained in another email, making
> liblzma more robust with memory allocation failures is not a perfect
> fix and can still result in severe problems depending on how the
> application as a whole works (with some apps it could be enough).

Yes. For liblzma you get the memory limitation from the caller. I've
seen in Debian's dpkg to use physmem/2 with 2GiB as upper limit on
32bit. That works ;)

> An alternative "fix" for the liblzma case could be adding a simple API
> function that would scale down the number of threads in a lzma_mt
> structure based on a memory usage limit and if the application is 32
> bits. Currently the thread count and LZMA2 settings adjusting code is
> in xz, not in liblzma.

It might help. dpkg checks the memlimit with
lzma_stream_encoder_mt_memusage() and decreases the memory limit until
it fits. It looks simpler compared to rpm's attempt and various
exceptions.

> > Anyway. Not to overcompilcate things: On Linux you can obtain the
> > available system memory which I would cap to 2 or 2.5 GiB by default.
> > Nobody should be hurt by that.
> 
> If full 4 GiB of address space is available, capping to 2 GiB to 2.5 GiB
> when the available memory isn't known would mean fewer threads than
> with the 4020 MiB limit. Obviously this is less bad than failing due to
> running out of address space but it still makes me feel that if
> available memory is used on Linux, it should be ported to other OSes
> too.

I didn't understand that last sentance.

> The idea for the current 4020 MiB special limit is based on a patch
> that was in use in FreeBSD to solve the problem of 32-bit xz on 64-bit
> kernel. So at least FreeBSD should be supported to not make 32-bit xz
> worse under 64-bit FreeBSD kernel.

Is this a common case? 
While poking around, Linux has this personality() syscall/function.
There is a flag called PER_LINUX32_3GB and PER_LINUX_32BIT which are set
if the command is invoked with `linux32' say
linux32 xz

then it would set that flag set and could act. It is not set by starting
a 32bit application on a 64bit kernel on its own or on a 32bit kernel.
I don't know if this is common practise but I use this in my chroots. So
commands like `uname -m' return `i686' instead of `x86_64'.
If other chroot environments do it as well then it could be used as a
hack to assume that it is run on 64bit kernel. That is if we want that
ofcourse :)

> In liblzma, if a new function is added to reduce the thread count based
> on a memory usage limit, a capping the limit to 2 to 3 GiB on 32-bit
> applications could be fine even if there is more available memory. Being
> conservative means fewer threads but it would make it more likely that
> things keep working i

[xz-devel] [PATCH] Use outq

2021-01-16 Thread Sebastian Andrzej Siewior
From: Sebastian Andrzej Siewior 

This is against the MT decoder to use the outq instead the self cooked
one buffer/thread to illustrate what had to be done.
It appears to work.

The numbers changed from
  xz -tv
   100 % 10,2 GiB / 40,0 GiB = 0,255   958 MiB/s   0:42
  xz -dv | openssl sha1
   100 % 10,2 GiB / 40,0 GiB = 0,255   813 MiB/s   0:50

to
  xz -tv
   100 % 10,2 GiB / 40,0 GiB = 0,255   1,1 GiB/s   0:36
  xz -dv | openssl sha1
   100 % 10,2 GiB / 40,0 GiB = 0,255   914 MiB/s   0:44

Signed-off-by: Sebastian Andrzej Siewior 
---
 src/liblzma/common/outqueue.h  |   9 +
 src/liblzma/common/stream_decoder_mt.c | 277 +
 2 files changed, 148 insertions(+), 138 deletions(-)

diff --git a/src/liblzma/common/outqueue.h b/src/liblzma/common/outqueue.h
index 355e0ced2cfc3..de630ee855c6b 100644
--- a/src/liblzma/common/outqueue.h
+++ b/src/liblzma/common/outqueue.h
@@ -203,6 +203,15 @@ lzma_outq_has_buf(const lzma_outq *outq)
return outq->bufs_in_use < outq->bufs_limit;
 }
 
+/// \brief  Test if there is at least one preallocated buffer free
+///
+/// This returns true then a new buffer will be pre-allocated.
+///
+static inline bool
+lzma_outq_has_buf_prealloc(const lzma_outq *outq)
+{
+   return outq->bufs_in_use < outq->bufs_allocated;
+}
 
 /// \brief  Test if the queue is completely empty
 static inline bool
diff --git a/src/liblzma/common/stream_decoder_mt.c 
b/src/liblzma/common/stream_decoder_mt.c
index 1bfd2279c176b..8e361c38f27a2 100644
--- a/src/liblzma/common/stream_decoder_mt.c
+++ b/src/liblzma/common/stream_decoder_mt.c
@@ -14,6 +14,7 @@
 #include "block_decoder.h"
 #include "stream_decoder.h"
 #include "index.h"
+#include "outqueue.h"
 
 #include 
 
@@ -33,16 +34,6 @@ typedef enum {
 
 } worker_state;
 
-struct out_buffer {
-   uint8_t *out;
-   /// Size of ->out
-   size_t out_block_size;
-   /// Bytes written to ->out (worker)
-   size_t out_pos;
-   /// Bytes consumed of ->out (coordinator)
-   size_t out_filled;
-};
-
 struct worker_thread {
uint8_t *in;
/// Size of ->in
@@ -63,7 +54,9 @@ struct worker_thread {
/// The allocator is set by the main thread.
const lzma_allocator *allocator;
 
-   struct out_buffer out;
+   lzma_outbuf *outbuf;
+   bool partial_update;
+   size_t secret_progress;
 
lzma_next_coder block_decoder;
lzma_block block_options;
@@ -102,12 +95,12 @@ struct lzma_stream_coder {
/// are created only when actually needed.
struct worker_thread *threads_free;
/// Current thread decompressed is read from
-   struct worker_thread *thr_read;
-   /// Last read thread, used for ->next assignment
-   struct worker_thread *thr_read_last;
+
/// Current thread compressed data is written to
struct worker_thread *thr_write;
 
+   lzma_outq outq;
+
/// Memory usage limit
uint64_t memlimit;
/// Amount of memory actually needed (only an estimate)
@@ -140,6 +133,15 @@ struct lzma_stream_coder {
uint8_t buffer[LZMA_BLOCK_HEADER_SIZE_MAX];
 };
 
+static void thr_do_partial_update(void *thr_ptr)
+{
+   struct worker_thread *thr = thr_ptr;
+
+   mythread_mutex_lock(>mutex);
+   thr->partial_update = true;
+   mythread_mutex_unlock(>mutex);
+}
+
 /// Use smaller chunks so cancellation attempts don't block for long
 #define CHUNK_SIZE 16384
 static MYTHREAD_RET_TYPE worker_decoder(void *thr_ptr)
@@ -148,7 +150,6 @@ static MYTHREAD_RET_TYPE worker_decoder(void *thr_ptr)
size_t in_filled;
size_t out_pos;
lzma_ret ret;
-   struct out_buffer *out;
 
 next_loop_lock:
 
@@ -162,7 +163,7 @@ static MYTHREAD_RET_TYPE worker_decoder(void *thr_ptr)
mythread_mutex_unlock(>mutex);
 
lzma_free(thr->in, thr->allocator);
-   lzma_free(thr->out.out, thr->allocator);
+
lzma_next_end(>block_decoder, thr->allocator);
 
mythread_mutex_destroy(>mutex);
@@ -190,35 +191,57 @@ static MYTHREAD_RET_TYPE worker_decoder(void *thr_ptr)
mythread_cond_wait(>cond, >mutex);
goto next_loop_unlocked;
}
-   out = >out;
+
mythread_mutex_unlock(>mutex);
 
if ((in_filled - thr->in_pos) > CHUNK_SIZE)
in_filled = thr->in_pos + CHUNK_SIZE;
 
-   out_pos = out->out_pos;
+   out_pos = thr->secret_progress;
+
ret = thr->block_decoder.code(thr->block_decoder.coder,
  thr->allocator,
  thr->in, >in_pos, in_filled,
- out->out, _pos, out->out_block_size,
+ 

Re: [xz-devel] [PATCH v2] liblzma: Add multi-threaded decoder

2021-01-10 Thread Sebastian Andrzej Siewior
On 2021-01-09 22:21:23 [+0200], Lasse Collin wrote:
> Hello!
Hi,

> This sounds great but unfortunately I still haven't been able to
> properly read it yet. It may take a few more days. I apologize. :-(

no need to apologize.

> As a minor update from my side, I committed lzma_outq changes that I
> mostly did two weeks ago. I believe it should now be usable for
> threaded decompression too to decouple output buffers from threads.
> However, I don't mean that you must use it. If you wish to use it, it's
> OK to do the change later.

Let me look into that updated lzma_outq.

Sebastian



Re: [xz-devel] [PATCH v2] xz: Ignore hard link count if not deleting

2021-01-10 Thread Sebastian Andrzej Siewior
On 2021-01-09 18:21:45 [+0200], Lasse Collin wrote:
> > Ignore hard link count on input.
> 
> I'm fine with the patch since the suggested behavior makes sense to me
> too. There's always a risk that some specific use case can break due to
> this change but I guess that is acceptably low risk.
> 
> I wonder if it were more logical if --keep also allowed processing of
> files with setuid, setgid, or sticky bit set. The same reasoning
> applies to all these cases, I suppose, and they are all within the same
> if-block in the code so it would still be a single-line change.
> 
> What about symlinks or files that aren't regular files (like block
> devices)? I guess requiring --force for non-regular files makes sense
> still but perhaps symlinks that point to regular files are OK to
> process with --keep.
> 
> Any thoughts on this patch?

Yes, I think it makes sense. Following the symlink makes sense but
removing the symlink is different from removing a file and since it is
not removed it could be ignored.

Sebastian



Re: [xz-devel] [PATCH] xz: Fix setting memory limit on 32-bit systems

2021-01-10 Thread Sebastian Andrzej Siewior
On 2021-01-08 18:40:18 [+0200], Lasse Collin wrote:
> Hello!
Hi,

> Sorry for the two-week silence, I was ill. It will take a few days for
> me to catch up with the emails.

No worries. Take your time to get better.

> On 2020-12-26 Sebastian Andrzej Siewior wrote:
> > On 2020-12-26 09:33:04 [+0300], Vitaly Chikunov wrote:
> > > This wasn't working, because `memlimit_compress` initialized with
> > > zero, thus memory limit is never lowered for 32-bit address space,
> > > causing `Cannot allocate memory' error (in `lzma_outq_init()'). For
> > > example, when `-T0' is used on 32 CPUs with compression level
> > > higher than `-6'.  
> > 
> > That is one way. It might be that hardware_init() should pass
> > `total_ram' to hardware_memlimit_set() instead of 0.
> > hardware_memlimit_get() treats 0 as unlimited but I don't think it
> > makes sense since memory is never unlimited.
> 
> 0 means disabled, that is, xz is expected to behave just like most
> other programs that might allocate a lot of memory but don't have any
> internal memory usage limiting. Memory isn't unlimited but many
> programs sort of behave as if it were and fail hard if allocation
> fails. That's not robust but it seems to work most of the time and
> many seem find this to be acceptable behavior in general.

I hope for sane defaults :)

> The whole limiter feature exist because I felt it was good to have a
> mechanism to control the memory usage, especially when decompressing
> since a .xz file may cause xz to allocate 4 GiB of memory for a single
> thread. However, I think few people think the same and thus the limiter
> is off by default for both compression and decompression.

The memory limiter sounds reasonable - no doubts.

> > Also, 32bit with almost 4GiB as a limit is working. If you increase
> > your input (the example from your previous email) then you also end
> > up "can not allocate memory) simply because 32bit can not allocate
> > 4GiB of memory. I'm not sure if the actual memory limit is exported.
> > It is usually at around 3GiB but there architectures which allow less
> > than that (not to mention kernel configurations).
> 
> I don't know if Linux makes it possible for userspace applications to
> know the available address space. It can indeed vary depending on the
> kernel config. xz also needs to be portable to many other kernels. The
> 4020 MiB hack works with 64-bit kernels running 32-bit applications
> since in that case many kernels provide 4 GiB of address space.

That is kind of a pain. I'm not aware of anything that reports the
possible address limit other than some test-and-error hacks.
Debian had a 2:2 split and was "forced" to switch to a 3:1 split because
some java applications expected / required a larger virtual address
space. I think that every distro ships a 3:1 32bit kernel now.
You can also have architecture level limitations. If I remember correctly
there was (is) a MIPS achitecture which can not assign more than 2GiB of
address space to a single application.

Anyway. Not to overcompilcate things: On Linux you can obtain the
available system memory which I would cap to 2 or 2.5 GiB by default.
Nobody should be hurt by that.

> There are also resource limits that may also be somewhat OS-specific.
> On GNU/Linux one can use "ulimit -v LIM" where LIM is the virtual
> memory limit in KiB. Trying to exceed it will result in ENOMEM just
> like when running out of address space.
> 
> Trying to figure out the various limits doesn't sound practical
> especially if it is supposed to work with kernels other than Linux.
> Simply trying to allocate a lot of memory (to test if it works) is more
> realistic but I think it's still dumb.

Oh, I though that this isn't Linux only. xz could query it but for
liblzma it is imposible since it is part of something bigger. If the
user is able to set ulimit, it is reasonable to assume that he can also
use -M. While I observed it a few times that a script invoked "xz -T0"
as somepoint which led to bad outcome on big iron.

Sebastian



[xz-devel] [PATCH v2] liblzma: Add multi-threaded decoder

2021-01-05 Thread Sebastian Andrzej Siewior
From: Sebastian Andrzej Siewior 

Changes since v1:
- Use the `timeout' parameter / mythread_cond_timedwait() in the main
  thread.
- A little bit of doxygen documentation.
- Added enum `lzma_memlimit_opt' to lzma_stream_decoder_mt() as an init
  parameter. The idea is to specify how to obey the memory limit so the
  user can keep using one API and not worry to fail due to the memory
  limit. Lets assume the archive has a 9MiB dictionary, 24MiB block of
  uncompressed data. The archive contains two compressed blocks of 10
  MiB each. Using two threads, the memory requirement is roughly
 (9 + 24 + 10) * 2 = 86 MiB

  On a system with 64 MiB of memory with additional 128MiB of swap it
  likely leads to the use of (say 30 MiB) swap memory during
  decompression which will slow down the whole operation.
  The synchronous API would do just fine with only 9 MiB of memory.

  So to not complicate things, invoking lzma_stream_decoder_mt() with a
  memory limit of 32 MiB three scenarios are possible:
  - LZMA_MEMLIMIT_THREAD
One thread requires 43MiB of memory and would exceed the memory
limit. However, continue with one thread instead of possible two.

  - LZMA_MEMLIMIT_NO_THREAD
One thread requires 43MiB of memory and would exceed the memory
limit. Fallback to the synchronous API without buffered input /
output memory.

  - LZMA_MEMLIMIT_COMPLETE
In this scenario it would behave like LZMA_MEMLIMIT_NO_THREAD.
However, with a dictionary size > 32MiB it would abort.

- Add get_free_mem() to get free memory on Linux. It reads the value of
  `MemAvailable' from `/proc/meminfo'. The difference compared to
  `lzma_physmem()' is that it returns the amout of memory that the
  system thinks is usable without using swap memory.
  A big machine may have 128GiB of memory but may also have a few
  applications running consuming 88GiB of it. The `free' command may
  report 38 GiB in page cache and 2 GiB in free. The `MemAvailable' will
  report something between 37 - 39 GiB area.

- New threads are only created if memory limit does not exceed
  filter_size + decomp_block * 2 which accounts the output and
  worst-case input (uncompressed data).
  This is to avoid exceeding the memory limit while extending the input
  buffer.

Changes since RFC:
- Options are considered
- Memory limits are considered. The limit may get exceeded if we get
  close to it and then the existing threads enlarge their in-buffer.
- Blocks with no size information in the header can be decompressed.
  This happens synchronous.

Signed-off-by: Sebastian Andrzej Siewior 
---
 src/liblzma/api/lzma/container.h   |   79 +-
 src/liblzma/common/Makefile.inc|6 +
 src/liblzma/common/stream_decoder_mt.c | 1122 
 src/liblzma/liblzma.map|1 +
 src/xz/coder.c |  110 ++-
 5 files changed, 1313 insertions(+), 5 deletions(-)
 create mode 100644 src/liblzma/common/stream_decoder_mt.c

diff --git a/src/liblzma/api/lzma/container.h b/src/liblzma/api/lzma/container.h
index 9fbf4df06178e..761c358f0362b 100644
--- a/src/liblzma/api/lzma/container.h
+++ b/src/liblzma/api/lzma/container.h
@@ -173,7 +173,7 @@ typedef struct {
uint32_t reserved_int2;
uint32_t reserved_int3;
uint32_t reserved_int4;
-   uint64_t reserved_int5;
+   uint64_t memlimit;
uint64_t reserved_int6;
uint64_t reserved_int7;
uint64_t reserved_int8;
@@ -538,6 +538,83 @@ extern LZMA_API(lzma_ret) lzma_stream_decoder(
lzma_nothrow lzma_attr_warn_unused_result;
 
 
+/**
+ * \brief   The `memory_limit' argument for lzma_stream_decoder_mt()
+ *
+ * The required memory limit for decompression is determined later once the XZ
+ * block header is parsed and the compression parameters are known.
+ * Each thread needs to allocate the memory for the filters (size of the
+ * dictionary) and a buffer for the compressed and decompressed data.
+ * This option specifies how to proceed if is not possible to continue even 
with
+ * one thread.
+ */
+typedef enum {
+
+
+   LZMA_MEMLIMIT_COMPLETE = 0,
+   /**<
+* \brief   Abort decompression.
+*
+* Decompression will fail if the specified memory limit would be
+* exceeded even with one thread.
+*/
+
+
+   LZMA_MEMLIMIT_THREAD = 1,
+   /**<
+* \brief   Continue with one thread.
+*
+* Should the required memory exceed the memory limit then continue with
+* one thread. The input and output is buffered and decompression can
+* happen asynchronous.
+*/
+
+
+   LZMA_MEMLIMIT_NO_THREAD = 2
+   /**<
+* \brief   Continue with without a thread.
+*
+* Should the required memory exceed the memory limit then continue
+* without a thread. The input and output is not buffered and
+* decompression is performed in synchronous fash

[xz-devel] [PATCH v2] xz: Ignore hard link count if not deleting

2020-12-28 Thread Sebastian Andrzej Siewior
From: Sebastian Andrzej Siewior 

xz refuses to decompress a file which has more than one hard link. It
can be reproduced by (as per Vincent):
|$ echo foo > file1
|$ xz file1
|$ ln file1.xz file2.xz
|$ xz -dk file1.xz
|xz: file1.xz: Input file has more than one hard link, skipping

This behaviour is consistent with `gzip' and `bzip2' but it is not
documented. The `--force' option would ignore this restriction.

I traced it back in `gzip' to the 90s but the change was not documented
as why it was needed. It was moved, altered but not documented. At some
point the error was restricted to <= 2 which might be related to disk
quota.

Ignore hard link count on input.

Debian BTS: https://bugs.debian.org/975981
Reported-by: Vincent Lefevre 
Signed-off-by: Sebastian Andrzej Siewior 
---
On 2020-12-27 18:02:44 [+0100], Vincent Lefevre wrote:
> Note that when I reported the bug, I used the -k option (without
> it, I may understand the reason why xz refuses the operation with

adjusted.

 src/xz/file_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/xz/file_io.c b/src/xz/file_io.c
index 0ba8db8fbc4cc..7703e08b75ea8 100644
--- a/src/xz/file_io.c
+++ b/src/xz/file_io.c
@@ -698,7 +698,7 @@ io_open_src_real(file_pair *pair)
goto error;
}
 
-   if (pair->src_st.st_nlink > 1) {
+   if (pair->src_st.st_nlink > 1 && !opt_keep_original) {
message_warning(_("%s: Input file has more "
"than one hard link, "
"skipping"), pair->src_name);
-- 
2.30.0.rc2




[xz-devel] Re: [PATCH] xz: Don't care if input has more than one hard link.

2020-12-27 Thread Sebastian Andrzej Siewior
On 2020-12-27 18:14:51 [+0100], Vincent Lefevre wrote:
> > On 2020-12-27 15:47:05 +0100, Sebastian Andrzej Siewior wrote:
> > > I traced it back in `gzip' to the 90s but the change was not documented
> > > as such. It was moved, altered but not documented. At some point the
> > > error was restricted to <= 2 which might be related to disk quota.
> 
> I think I've found the answer for gzip:
> 
> Sat Jan  21 15:46:38 1993  Jean-loup Gailly  (jl...@chorus.fr)
> 
>   Uncompress files with multiple links only with -f.

Yes, but it never reveals why it was needed / a good idea.

Sebastian



[xz-devel] Re: [PATCH] xz: Don't care if input has more than one hard link.

2020-12-27 Thread Sebastian Andrzej Siewior
On 2020-12-27 18:02:44 [+0100], Vincent Lefevre wrote:
> On 2020-12-27 15:47:05 +0100, Sebastian Andrzej Siewior wrote:
> > xz refuses to decompress a file which has more than one hard link. It
> > can be reproduced byi (as per Vincent):
> > |$ echo foo > file1
> > |$ xz file1
> > |$ ln file1.xz file2.xz
> > |$ xz -d file1.xz
> > |xz: file1.xz: Input file has more than one hard link, skipping
> 
> Note that when I reported the bug, I used the -k option (without
> it, I may understand the reason why xz refuses the operation with
> more than one hard link, because removing the link will lose some
> information: the fact that 2 files were hard-linked).

Indeed, I lost -k option. So without removing the file it actually makes
no sense.
But yes: you lose the hard-link count but I don't see why this is
important.

> > This behaviour is consistent with `gzip' and `bzip2' but it is not
> > documented. The `--force' option would ignore this restriction.
> > 
> > I traced it back in `gzip' to the 90s but the change was not documented
> > as such. It was moved, altered but not documented. At some point the
> > error was restricted to <= 2 which might be related to disk quota.
> 
> I don't see why disk quota would have an influence: whatever
> the number of link counts, a new file (the decompressed one)
> is created.

My understanding is that the first file is accounted on users' quota and
the hardlink (by another user) isn't. If the user, that had the file
accounted, removes the file then it gets accounted on the disk quota of
the other user which may exceed the his quota.
I never used disk quota, this a story I remember from back then.

Sebastian



[xz-devel] [PATCH] xz: Don't care if input has more than one hard link.

2020-12-27 Thread Sebastian Andrzej Siewior
From: Sebastian Andrzej Siewior 

xz refuses to decompress a file which has more than one hard link. It
can be reproduced byi (as per Vincent):
|$ echo foo > file1
|$ xz file1
|$ ln file1.xz file2.xz
|$ xz -d file1.xz
|xz: file1.xz: Input file has more than one hard link, skipping

This behaviour is consistent with `gzip' and `bzip2' but it is not
documented. The `--force' option would ignore this restriction.

I traced it back in `gzip' to the 90s but the change was not documented
as such. It was moved, altered but not documented. At some point the
error was restricted to <= 2 which might be related to disk quota.

Ignore hard link count on input.

Debian BTS: https://bugs.debian.org/975981
Reported-by: Vincent Lefevre 
Signed-off-by: Sebastian Andrzej Siewior 
---
 src/xz/file_io.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/src/xz/file_io.c b/src/xz/file_io.c
index 0ba8db8fbc4cc..7fa17431c5dbc 100644
--- a/src/xz/file_io.c
+++ b/src/xz/file_io.c
@@ -697,13 +697,6 @@ io_open_src_real(file_pair *pair)
pair->src_name);
goto error;
}
-
-   if (pair->src_st.st_nlink > 1) {
-   message_warning(_("%s: Input file has more "
-   "than one hard link, "
-   "skipping"), pair->src_name);
-   goto error;
-   }
}
 
// If it is something else than a regular file, wait until
-- 
2.30.0.rc2




Re: [xz-devel] [PATCH] xz: Fix setting memory limit on 32-bit systems

2020-12-26 Thread Sebastian Andrzej Siewior
On 2020-12-26 09:33:04 [+0300], Vitaly Chikunov wrote:
> This wasn't working, because `memlimit_compress` initialized with zero,
> thus memory limit is never lowered for 32-bit address space, causing
> `Cannot allocate memory' error (in `lzma_outq_init()'). For example,
> when `-T0' is used on 32 CPUs with compression level higher than `-6'.

That is one way. It might be that hardware_init() should pass
`total_ram' to hardware_memlimit_set() instead of 0.
hardware_memlimit_get() treats 0 as unlimited but I don't think it makes
sense since memory is never unlimited.
Also, 32bit with almost 4GiB as a limit is working. If you increase your
input (the example from your previous email) then you also end up "can
not allocate memory) simply because 32bit can not allocate 4GiB of
memory. I'm not sure if the actual memory limit is exported. It is
usually at around 3GiB but there architectures which allow less than
that (not to mention kernel configurations).

> Signed-off-by: Vitaly Chikunov 

Sebastian



[xz-devel] [PATCH] xzdiff: Trap SIGPIPE

2020-12-24 Thread Sebastian Andrzej Siewior
From: Sebastian Andrzej Siewior 

The `cmp' command will return early if a difference is found while the
shell script is still invoking the decompressor which writes into the
closed FD. This results in SIGPIPE / exit code 141.
By ignoring SIGPIPE the real return code from `cmp' is observed which is
`1' and xzdiff exits with `1'. Without ignoring SIGPIPE the exitcode 141
is observed and xzdiff returns with `2'.

Reported to Debian BTS as #844770. Change suggested by Étienne Mollierö.

BTS: https://bugs.debian.org/844770

Signed-off-by: Sebastian Andrzej Siewior 
---
 src/scripts/xzdiff.in | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/scripts/xzdiff.in b/src/scripts/xzdiff.in
index eb7825c1dfba9..d343a889e0e52 100644
--- a/src/scripts/xzdiff.in
+++ b/src/scripts/xzdiff.in
@@ -1,4 +1,5 @@
 #!@POSIX_SHELL@
+trap '' PIPE
 
 # Copyright (C) 1998, 2002, 2006, 2007 Free Software Foundation
 # Copyright (C) 1993 Jean-loup Gailly
-- 
2.29.2




Re: [xz-devel] [RFC PATCH] liblzma: Test memory availability for lzma_stream_encoder_mt

2020-12-24 Thread Sebastian Andrzej Siewior
On 2020-12-24 00:51:58 [+0300], Vitaly Chikunov wrote:
> Add guard call to allocate the memory before calling get_thread to
> prevent memory failures (LZMA_MEM_ERROR).
> 
> This would make simple `xz -T0' more robust on 32-bit architectures.
> 
> Rationale: simple `-T0' is hard to use portably in scripts on different
> platforms, because there is always different amount of RAM and CPU,
> causing unexpected "xz: Cannot allocate memory" errors.

Could you please say how much CPUs, memory you have and what command
line you have used?

> Signed-off-by: Vitaly Chikunov 

Sebastian



Re: [xz-devel] [RFC WIP] liblzma: Add multi-threaded decoder

2020-12-23 Thread Sebastian Andrzej Siewior
On 2020-12-17 00:12:06 [+0200], Lasse Collin wrote:
> When decoding one can get smoother output by copying decompressed data
> out in smaller chunks. Your code does this but it's still a single
> mutex for all threads. With many threads that is easily thousands or
> even tens of thousands of locks/unlocks per second from all threads
> combined. I don't know how much it matters and if it is worth it to
> make it more complex. For example, one could have a thread-specific flag
> indicating if the main thread is interested in the data from that
> thread. Then only that thread would lock/signal/unlock the main mutex
> when a chunk of data is ready.

I moved parts of the memcpy() out the locked section. Only the thread,
that is currently decompressing is waking the main thread. However the
current output position is updated under the main-thread's mutex. So
that might be not optimal.

Let me know if you have other concerns or of this part should be
reworked somehow.

Sebastian



[xz-devel] [PATCH] liblzma: Add multi-threaded decoder

2020-12-23 Thread Sebastian Andrzej Siewior
From: Sebastian Andrzej Siewior 

Changes since last post:
- Options are considered
- Memory limits are considered. The limit may get exceeded if we get
  close to it and then the existing threads enlarge their in-buffer.
- Blocks with no size information in the header can be decompressed.
  This happens synchronous.

Signed-off-by: Sebastian Andrzej Siewior 
---
 src/liblzma/api/lzma/container.h   |5 +-
 src/liblzma/common/Makefile.inc|6 +
 src/liblzma/common/stream_decoder_mt.c | 1058 
 src/liblzma/liblzma.map|1 +
 src/xz/coder.c |   15 +-
 5 files changed, 1080 insertions(+), 5 deletions(-)
 create mode 100644 src/liblzma/common/stream_decoder_mt.c

diff --git a/src/liblzma/api/lzma/container.h b/src/liblzma/api/lzma/container.h
index 9fbf4df06178e..de0a77b5d6482 100644
--- a/src/liblzma/api/lzma/container.h
+++ b/src/liblzma/api/lzma/container.h
@@ -173,7 +173,7 @@ typedef struct {
uint32_t reserved_int2;
uint32_t reserved_int3;
uint32_t reserved_int4;
-   uint64_t reserved_int5;
+   uint64_t memlimit;
uint64_t reserved_int6;
uint64_t reserved_int7;
uint64_t reserved_int8;
@@ -630,3 +630,6 @@ extern LZMA_API(lzma_ret) lzma_stream_buffer_decode(
const uint8_t *in, size_t *in_pos, size_t in_size,
uint8_t *out, size_t *out_pos, size_t out_size)
lzma_nothrow lzma_attr_warn_unused_result;
+
+extern LZMA_API(lzma_ret)
+   lzma_stream_decoder_mt(lzma_stream *strm, const lzma_mt *options);
diff --git a/src/liblzma/common/Makefile.inc b/src/liblzma/common/Makefile.inc
index 0408f9a48c4db..3c140c2955475 100644
--- a/src/liblzma/common/Makefile.inc
+++ b/src/liblzma/common/Makefile.inc
@@ -78,4 +78,10 @@ liblzma_la_SOURCES += \
common/stream_decoder.h \
common/stream_flags_decoder.c \
common/vli_decoder.c
+
+if COND_THREADS
+liblzma_la_SOURCES += \
+   common/stream_decoder_mt.c
+endif
+
 endif
diff --git a/src/liblzma/common/stream_decoder_mt.c 
b/src/liblzma/common/stream_decoder_mt.c
new file mode 100644
index 0..b2f1c9ebfa607
--- /dev/null
+++ b/src/liblzma/common/stream_decoder_mt.c
@@ -0,0 +1,1058 @@
+///
+//
+/// \file   stream_decoder_mt.c
+/// \brief  Multithreaded .xz Stream decoder
+//
+//  Author: Sebastian Andrzej Siewior
+//
+//  This file has been put into the public domain.
+//  You can do whatever you want with this file.
+//
+///
+
+#include "common.h"
+#include "block_decoder.h"
+#include "outqueue.h"
+#include "stream_decoder.h"
+#include "index.h"
+
+#include 
+
+typedef enum {
+   /// Waiting for work.
+   THR_IDLE,
+
+   /// Decoding is in progress.
+   THR_RUN,
+
+   /// The main thread wants the thread to stop whatever it was doing
+   /// but not exit.
+   THR_STOP,
+
+   /// The main thread wants the thread to exit.
+   THR_EXIT,
+
+} worker_state;
+
+struct out_buffer {
+   uint8_t *out;
+   size_t out_block_size;  /* Size of ->out*/
+   size_t out_pos; /* Bytes written to ->out (worker)  */
+   size_t out_filled;  /* Bytes consumed of ->out (coordinator) */
+};
+
+struct worker_thread {
+   worker_state state;
+
+   uint8_t *in;
+   size_t in_size; /* Size of ->in */
+   size_t in_block_size;   /* Size of current block*/
+   size_t in_filled;   /* Bytes written to ->in (coordinator)  */
+   size_t in_pos;  /* Bytes consumed of ->in (worker)  */
+
+   struct out_buffer out;
+
+   /// Pointer to the main structure is needed when putting this
+   /// thread back to the stack of free threads.
+   struct lzma_stream_coder *coder;
+
+   /* The allocator is set by the main thread. */
+   const lzma_allocator *allocator;
+   /* Filter size is used for memusage accounting */
+   size_t filter_size;
+
+   lzma_next_coder block_decoder;
+   lzma_block block_options;
+   struct worker_thread *next;
+
+   mythread_mutex mutex;
+   mythread_cond cond;
+   lzma_ret thread_error;
+
+   mythread thread_id;
+};
+
+struct lzma_stream_coder {
+   enum {
+   SEQ_STREAM_HEADER,
+   SEQ_BLOCK_HEADER,
+   SEQ_BLOCK,
+   SEQ_INDEX,
+   SEQ_STREAM_FOOTER,
+   SEQ_STREAM_PADDING,
+   } sequence;
+
+
+   /// Memory usage limit
+   uint64_t memlimit;
+   /// Amount of memory actually needed (only an estimate)
+   uint64_t memusage;
+   size_t exp_filter_size;
+   size_t exp_block_size;
+
+   lzma_index_hash *in

Re: [xz-devel] [RFC WIP] liblzma: Add multi-threaded decoder

2020-12-14 Thread Sebastian Andrzej Siewior
On 2020-12-13 23:19:25 [+0200], Lasse Collin wrote:
> > Threads, which finished decoding, remain idle until their output
> > buffer has been fully consumed. The output buffer once allocated
> > remains allocated until the thread is cleaned up. This saved 5 secs
> > in the example above compared to freeing the buffer once the buffer
> > was fully consumed and allocating it again once there is new data.
> > The input buffer is freshly allocated for each block since they vary
> > in size in general.
> 
> Yes, reusing buffers and encoder/decoder states can be useful (fewer
> page faults). Perhaps even the input buffer could be reused if it is OK
> to waste some memory and it makes a difference in speed.

I tried two archives with 16 & 3 CPUs and the time remained the same. I
tried to only increase the in-buffer and to allocate the block-size also
for the in-buffer. No change.

> > I made my own output queue since the output size is known. I have no
> > idea if this is good or if it would be better to use lzma_outq
> > instead.
> 
> The current lzma_outq isn't flexible enough for a decoder. It's a bit
> primitive even for encoding: it works fine but it wastes a little
> memory. However, since the LZMA encoder needs a lot of memory anyway,
> the overall difference is around (or under) 10 % which likely doesn't
> matter too much.
> 
> The idea of lzma_outq is to have a pool for output buffers that is
> separate from the pool of worker threads. Different data takes
> different amount of time to compress. The separate pools allow Blocks
> to finish out of order and reusing worker threads immediately as long
> as there is enough extra buffer space in the output queue. This is an
> important detail for encoder performance (to prevent idle threads) and
> with a quick try it seems it might help with decoding too. The
> significance depends a lot on the data, of course.

I tried to decouple the thread and the out-buffer but after several failures I
tried to keep it simple for the start.
I do have idle threads after a while with 16 CPUs. The alternative is to keep
them busy with more memory.  With 4 CPUs I get to
|  100 % 10,2 GiB / 40,0 GiB = 0,255   396 MiB/s   1:43 

and this is ofcourse a CPU bound due to the `sha1' part as consumer. I
don't know how reasonable this performace is if it means that you have
to write 400MiB/s to disk. Of course, should it become an issue then it
can still be decoupled.

Sebastian



[xz-devel] [RFC WIP] liblzma: Add multi-threaded decoder

2020-12-12 Thread Sebastian Andrzej Siewior
From: Sebastian Andrzej Siewior 

This is WIP, the decoder appears to work based on:

|$ xz -dv < buster-pl.xz | openssl sha1
|  100 % 10,2 GiB / 40,0 GiB = 0,255   114 MiB/s   6:00
|(stdin)= 5eb4e2a3ce2253a6ec3fc86ee7ad8db0a5395959
|
|vs
|
|$ ./src/xz/.libs/xz -dv < buster-pl.xz | openssl sha1
|  100 % 10,2 GiB / 40,0 GiB = 0,255   815 MiB/s   0:50
|  (stdin)= 5eb4e2a3ce2253a6ec3fc86ee7ad8db0a5395959

Options of any kind are hardcoded (it is WIP after all).
For successful decoding the block header needs to contain the
compressed/ uncompressed size (the "cu" in the Flags column of xz -lvv).
Decoding of blocks without sizes is also on the todo list so the user
can use one interface.

Threads, which finished decoding, remain idle until their output buffer
has been fully consumed. The output buffer once allocated remains
allocated until the thread is cleaned up. This saved 5 secs in the
example above compared to freeing the buffer once the buffer was fully
consumed and allocating it again once there is new data.
The input buffer is freshly allocated for each block since they vary in
size in general.

Parts of the mt-decoder are copied from the other decoder. Not sure if
this is good or should be merged somehow with the single threaded
decoder.
I made my own output queue since the output size is known. I have no
idea if this is good or if it would be better to use lzma_outq instead.

Signed-off-by: Sebastian Andrzej Siewior 
---
 src/liblzma/Makefile.am|   2 +-
 src/liblzma/api/lzma/container.h   |   5 +
 src/liblzma/common/Makefile.inc|   6 +
 src/liblzma/common/stream_decoder_mt.c | 976 +
 src/liblzma/liblzma.map|   3 +-
 src/xz/coder.c |   4 +
 6 files changed, 994 insertions(+), 2 deletions(-)
 create mode 100644 src/liblzma/common/stream_decoder_mt.c

diff --git a/src/liblzma/Makefile.am b/src/liblzma/Makefile.am
index 6323e26aade10..1f2445d94a33c 100644
--- a/src/liblzma/Makefile.am
+++ b/src/liblzma/Makefile.am
@@ -13,7 +13,7 @@ doc_DATA =
 
 lib_LTLIBRARIES = liblzma.la
 liblzma_la_SOURCES =
-liblzma_la_CPPFLAGS = \
+liblzma_la_CPPFLAGS = -Wno-unused-parameter \
-I$(top_srcdir)/src/liblzma/api \
-I$(top_srcdir)/src/liblzma/common \
-I$(top_srcdir)/src/liblzma/check \
diff --git a/src/liblzma/api/lzma/container.h b/src/liblzma/api/lzma/container.h
index 9fbf4df06178e..bd8e410215bdd 100644
--- a/src/liblzma/api/lzma/container.h
+++ b/src/liblzma/api/lzma/container.h
@@ -630,3 +630,8 @@ extern LZMA_API(lzma_ret) lzma_stream_buffer_decode(
const uint8_t *in, size_t *in_pos, size_t in_size,
uint8_t *out, size_t *out_pos, size_t out_size)
lzma_nothrow lzma_attr_warn_unused_result;
+
+extern LZMA_API(uint64_t)
+   lzma_stream_decoder_mt_memusage(const lzma_mt *options);
+extern LZMA_API(lzma_ret)
+   lzma_stream_decoder_mt(lzma_stream *strm, const lzma_mt *options);
diff --git a/src/liblzma/common/Makefile.inc b/src/liblzma/common/Makefile.inc
index 0408f9a48c4db..3c140c2955475 100644
--- a/src/liblzma/common/Makefile.inc
+++ b/src/liblzma/common/Makefile.inc
@@ -78,4 +78,10 @@ liblzma_la_SOURCES += \
common/stream_decoder.h \
common/stream_flags_decoder.c \
common/vli_decoder.c
+
+if COND_THREADS
+liblzma_la_SOURCES += \
+   common/stream_decoder_mt.c
+endif
+
 endif
diff --git a/src/liblzma/common/stream_decoder_mt.c 
b/src/liblzma/common/stream_decoder_mt.c
new file mode 100644
index 0..0101ab4339d59
--- /dev/null
+++ b/src/liblzma/common/stream_decoder_mt.c
@@ -0,0 +1,976 @@
+///
+//
+/// \file   stream_decoder_mt.c
+/// \brief  Multithreaded .xz Stream decoder
+//
+//  Author: Sebastian Andrzej Siewior
+//
+//  This file has been put into the public domain.
+//  You can do whatever you want with this file.
+//
+///
+
+#include "common.h"
+#include "block_decoder.h"
+#include "outqueue.h"
+#include "stream_decoder.h"
+#include "index.h"
+
+#include 
+
+typedef enum {
+   /// Waiting for work.
+   THR_IDLE,
+
+   /// Decoding is in progress.
+   THR_RUN,
+
+   /// The main thread wants the thread to stop whatever it was doing
+   /// but not exit.
+   THR_STOP,
+
+   /// The main thread wants the thread to exit. We could use
+   /// cancellation but since there's stopped anyway, this is lazier.
+   THR_EXIT,
+
+} worker_state;
+
+struct out_buffer {
+   uint8_t *out;
+   size_t out_block_size;  // Size of ->out
+   size_t out_pos; // Bytes written to ->out (worker)
+   size_t out_filled;  // Bytes consumed of ->out (coordinator)
+};
+
+struct worker_thread {
+