Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-31 Thread Lasse Collin
On 2022-03-17 Jia Tan wrote:
> I attached two patches to this message. The first should fix a bug
> with the timeouts.

Thanks! This and the deadlock are now fixed (I committed them a few days
ago).

> The second patch is for the memlimit_threading update. I added a new
> API function that will fail for anything that is not the multithreaded
> decoder.

I need to consider this a little later. Some of the things I will do
next (some already have a patch on this list):

  - Add fail-fast flag to lzma_stream_decoder_mt().

  - Possibly fix a corner case in threaded coder if lzma_code() is
called in a similar way as in zpipe.c in in
. That is, currently it doesn't
work but it can be made to work, I think. Supporting it makes
threaded decoder a little easier to adapt to existing apps if they
use that kind of decoding loop.

  - --memlmit-threading, I wrote this weeks ago except a few details
that need to be decided. For example, I guess -M should set
--memlimit-threading just like it sets --memlimit-compress and
--memlimit-decompress.

  - Initial version of automatic memlimit with --threads=0. First
version can be based on lzma_physmem() but other methods can be
added. Sebastian's patch uses MemAvailable on Linux, your patch
uses freemem from sysinfo() which equals MemFree in /proc/meminfo.
I suppose MemAvailable is a better starting point.

  - Support for forcing single/multi-threaded mode with --threads for
cases when xz decides to use only one thread.

  - Fix changing memlimit after LZMA_MEMLIMIT_ERROR in the old
single-threaded decoder. (I knew it's a rare use case but clearly
it's not a use case at all since I haven't seen bug reports.)

  - Your test framework patches

I suppose then the next alpha release is close to ready.

-- 
Lasse Collin



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-17 Thread Jia Tan
> > As promised, I have attached a patch to solve the problem. Instead of
> > doing as I had originally proposed, I simply added a wake up signal
> > to a sleeping thread if partial updates are enabled. When the worker
> > wakes up, it checks if no more input
> > is available and signals to the main thread if it has output ready
> > before going back
> > to sleep. This prevents the deadlock on my liblzma tests and testing
> > xz with/without timeout.
>
> Thanks to both of you for debugging this. I see now that I had
> completely missed this corner case. The patch looks correct except that
> the mutex locking order is wrong which can cause a new deadlock. If
> both thr->mutex and coder->mutex are locked at the same time,
> coder->mutex must be locked first.

Good catch on the deadlock I created by the order of the mutex locking.
I had forgotten that the cond_wait locks the mutex again after the
sleep is done.
Would you like me to fix this and submit another patch?

> About memlimit updates, that may indeed need some work but I don't know
> yet how much is worth the trouble. stream_decoder_mt_memconfig() has a
> few FIXMEs too, maybe they don't need to be changed but it needs to be
> decided.

I attached two patches to this message. The first should fix a bug
with the timeouts.
When a timeout is set and the main thread times out, ret = LZMA_TIMED_OUT.
The check at the end of the function checks if ret == LZMA_OK, and if
not it will
call threads_stop. If it stops a thread before it can mark the outbuf
as finished, the
main thread will always wait on that thread since its buffer is not finished.

The second patch is for the memlimit_threading update. I added a new API
function that will fail for anything that is not the multithreaded
decoder. I tested it
locally and it successfully resumed multithreading after adjusting the
function was
called.

>I'm in a hurry now but I should have time for xz next week. :-)

Let me know your thoughts on these patches when you have a chance :)

Jia Tan
From be3695ec59b62eb1c0872005ef58d54211b99cf1 Mon Sep 17 00:00:00 2001
From: jiat75 
Date: Thu, 17 Mar 2022 00:55:14 +0800
Subject: [PATCH] Fixes bug with timeout in multithreaded decoder
 The LZMA_TIMED_OUT was causing threads_stop to stop worker threads that were in
 progress, so they were not able to mark their output_buf to finished. This
 caused the main thread to believe the worker thread was still working on the
 block, even though it had already been stopped.

---
 src/liblzma/common/stream_decoder_mt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/liblzma/common/stream_decoder_mt.c b/src/liblzma/common/stream_decoder_mt.c
index 1665e22b..1828378c 100644
--- a/src/liblzma/common/stream_decoder_mt.c
+++ b/src/liblzma/common/stream_decoder_mt.c
@@ -817,7 +817,7 @@ read_output_and_wait(struct lzma_stream_coder *coder,
 	// If we are returning an error, then the application cannot get
 	// more output from us and thus keeping the threads running is
 	// useless and waste of CPU time.
-	if (ret != LZMA_OK)
+	if (ret != LZMA_OK && ret != LZMA_TIMED_OUT)
 		threads_stop(coder);
 
 	return ret;
-- 
2.32.0

From 8daba6345e711bfd1443b58b8566705b7670ac67 Mon Sep 17 00:00:00 2001
From: jiat75 
Date: Thu, 17 Mar 2022 18:07:50 +0800
Subject: [PATCH] Created lzma_memlimit_threading_set
 Can now change memlimit_threading in liblzma after
 initialization of the stream

---
 src/liblzma/api/lzma/base.h| 17 +
 src/liblzma/common/common.c| 23 +++
 src/liblzma/common/common.h|  8 
 src/liblzma/common/stream_decoder_mt.c | 15 +++
 src/liblzma/liblzma.map|  1 +
 5 files changed, 64 insertions(+)

diff --git a/src/liblzma/api/lzma/base.h b/src/liblzma/api/lzma/base.h
index 749e4fe7..0de0f309 100644
--- a/src/liblzma/api/lzma/base.h
+++ b/src/liblzma/api/lzma/base.h
@@ -699,3 +699,20 @@ extern LZMA_API(uint64_t) lzma_memlimit_get(const lzma_stream *strm)
  */
 extern LZMA_API(lzma_ret) lzma_memlimit_set(
 		lzma_stream *strm, uint64_t memlimit) lzma_nothrow;
+
+/**
+ * @brief
+ * This function is only supported by lzma_decoder_mt. This function allows
+ * changing the memlimit_threading value after initialization to adjust
+ * the memory limit before single threaded mode is used
+ *
+ * \param   strm  Pointer to lzma_stream that is at least
+ *initialized with LZMA_STREAM_INIT.
+ * \param   memlimit  New value for memlimit_threading
+ *
+ * \return  -LZMA_OK: New memlimit_threading value successfully set
+ *  -LZMA_PROG_ERROR: Invalid arguments, e.g. *strm doesn't
+ *   support memlimit_threading or was not properly initialized
+ */
+extern LZMA_API(lzma_ret) lzma_memlimit_threading_set(
+		lzma_stream *strm, uint64_t new_memlimit) lzma_nothrow;
diff --git a/src/liblzma/common/common.c b/src/liblzma/common/common.c
index 346fc7af..1755e617 100644

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-17 Thread Lasse Collin
On 2022-03-15 Jia Tan wrote:
> As promised, I have attached a patch to solve the problem. Instead of
> doing as I had originally proposed, I simply added a wake up signal
> to a sleeping thread if partial updates are enabled. When the worker
> wakes up, it checks if no more input
> is available and signals to the main thread if it has output ready
> before going back
> to sleep. This prevents the deadlock on my liblzma tests and testing
> xz with/without timeout.

Thanks to both of you for debugging this. I see now that I had
completely missed this corner case. The patch looks correct except that
the mutex locking order is wrong which can cause a new deadlock. If
both thr->mutex and coder->mutex are locked at the same time,
coder->mutex must be locked first.

About memlimit updates, that may indeed need some work but I don't know
yet how much is worth the trouble. stream_decoder_mt_memconfig() has a
few FIXMEs too, maybe they don't need to be changed but it needs to be
decided.

I'm in a hurry now but I should have time for xz next week. :-)

-- 
Lasse Collin



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-17 Thread Lasse Collin
Hello!

Once again, sorry for the delay. I will be busy the rest of the week. I
will get back to xz early next week.

On 2022-03-07 Sebastian Andrzej Siewior wrote:
> 32 cores:
> 
> | $ time ./src/xz/xz -tv tars.tar.xz -T0
> | tars.tar.xz (1/1)
> |   100 %  2.276,2 MiB / 18,2 GiB = 0,122   1,6 GiB/s   0:11
> | 
> | real0m11,162s
> | user5m44,108s
> | sys 0m1,988s
> 
> 256 cores:
> | $ time ./src/xz/xz -tv tars.tar.xz -T0
> | tars.tar.xz (1/1)
> |   100 %  2.276,2 MiB / 18,2 GiB = 0,122   3,4 GiB/s   0:05
> | 
> | real0m5,403s
> | user4m0,298s
> | sys 0m24,315s
> 
> it appears to work :) If I see this right, then the file is too small
> or xz too fast but it does not appear that xz manages to create more
> than 100 threads.

Thanks! The scaling is definitely good enough. :-) Even if there was
room for improvement I won't think about it much for now.

A curious thing above is the ratio of user-to-sys time. With more
threads a lot more is spent in syscalls.

> and decompression to disk
> | $ time ~bigeasy/xz/src/xz/xz -dvk tars.tar.xz -T0
> | tars.tar.xz (1/1)
> |   100 %  2.276,2 MiB / 18,2 GiB = 0,122   746 MiB/s   0:24
> | 
> | real0m25,064s
> | user3m49,175s
> | sys 0m29,748s
> 
> appears to block at around 10 to 14 threads or so and then it hangs
> at the end until disk I/O finishes. Decent.
> Assuming disk I/O is slow, say 10MiB/s, and we would 388 CPUs
> (blocks/2) then it would decompress the whole file into memory and
> stuck on disk I/O?

Yes.

I wonder if the way xz does I/O might affect performance. Every time
the 8192-byte input buffer is empty (that is, liblzma has consumed it),
xz will block reading more input until another 8192 bytes have been
read. As long as threads can consume more input, each call to
lzma_code() will use all 8192 bytes. Each call might pass up to 8192
bytes of output from liblzma to xz too. If compression ratio is high
and reading input isn't very fast, then perhaps performance might go
down because blocking on input prevents xz from producing more output.
Only when liblzma cannot consume more input xz will produce output at
full speed.

That is, I wonder if with slow input the output speed will be limited
until the input buffers inside liblzma have been filled. My explanation
isn't very good, sorry.

Ideally input and output would be in different threads but the liblzma
API doesn't really allow that. Based on your benchmarks the current
method likely is easily good enough in practice.

> In terms of scaling, xz -tv of that same file with with -T1…64:
[...]
> time of 1 CPU / 64 = (3 * 60 + 38) / 64 = 3.40625
> 
> Looks okay.

Yes, thanks!

> > If the input is broken, it should produce as much output as the
> > single-threaded stable version does. That is, if one thread detects
> > an error, the data before that point is first flushed out before
> > the error is reported. This has pros and cons. It would be easy to
> > add a flag to allow switching to fast error reporting for
> > applications that don't care about partial output from broken
> > files.  
> 
> I guess most of them don't care because an error is usually an abort,
> the sooner, the better. It is probably the exception that you want
> decompress it despite the error and maybe go on with the next block
> and see what is left.

I agree. Over 99 % of the time any error means that the whole output
will be discarded. However, I would like to make the threaded decoder
to (optionally) have very similar external behavior as the
single-threaded version for cases where it might matter. It's not
perfect at the moment but I think it's decent enough (bugs excluded).

Truncated files are a special case of corrupt input because, unless
LZMA_FINISH is used, liblzma cannot know if the input is truncated or
if there is merely a pause in the input for some application-specific
reason. That can result in LZMA_BUF_ERROR but if the application knows
that such pauses are possible then it can handle LZMA_BUF_ERROR
specially and continue decoding when more input is available.

-- 
Lasse Collin



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-15 Thread Jia Tan
> I tested this locally by setting the timeout for xz to 0 on line 57 of
> src/xz/coder.c and it deadlocks on a truncated xz file. It seems like a
> race condition because it deadlocks ~90% of the time in my liblzma tests.
> When debugging, I can see that the main thread does not return from
> lzma_code when the LZMA_FINISH action is given. When the decoder receives
> LZMA_FINISH, it should probably signal to the worker threads that no more
> input is coming and to exit with error if they are in the middle of a
> block. Even if LZMA_FINISH is not given, we still need to avoid deadlock at
> all costs. I will come up with a patch for this since it sounds like
> a fun problem :)

As promised, I have attached a patch to solve the problem. Instead of doing as I
had originally proposed, I simply added a wake up signal to a sleeping thread if
partial updates are enabled. When the worker wakes up, it checks if no
more input
is available and signals to the main thread if it has output ready
before going back
to sleep. This prevents the deadlock on my liblzma tests and testing xz
with/without timeout.

Jia Tan
From 5f020915b4e9b60759515df61b9812c1d45c8995 Mon Sep 17 00:00:00 2001
From: jiat75 
Date: Tue, 15 Mar 2022 20:25:50 +0800
Subject: [PATCH] Deadlock prevention with partial updates in worker thread
 These changes prevent deadlock in the mt decoder. The issue
 with truncated xz files was the main threaded had a race condition with the
 last worker thread. The main thread was telling the worker thread to do
 partial updates after the worker thread had already begun sleeping until more
 input arrived. The main thread then went to sleep until the worker thread
 finished its input. This resulted in deadlock since both threads are sleeping
 and waiting on the other to wake it

---
 src/liblzma/common/stream_decoder_mt.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/src/liblzma/common/stream_decoder_mt.c b/src/liblzma/common/stream_decoder_mt.c
index 1f1cd771..11d4cce7 100644
--- a/src/liblzma/common/stream_decoder_mt.c
+++ b/src/liblzma/common/stream_decoder_mt.c
@@ -304,6 +304,9 @@ worker_enable_partial_update(void *thr_ptr)
 
 	mythread_sync(thr->mutex) {
 		thr->partial_update = true;
+		// Signal to worker thread to wake it up
+		// in case it has a partial update ready
+		mythread_cond_signal(>cond);
 	}
 }
 
@@ -375,6 +378,19 @@ next_loop_unlocked:
 
 	if (in_filled == thr->in_pos) {
 		mythread_cond_wait(>cond, >mutex);
+		// If thr->partial_update is true and we have no new update,
+		// tell the main thread the progress made to avoid a
+		// race condition with the main thread setting partial
+		// update and this thread sleeping until more input
+		// arrives. This is only necessary if there is a truncated
+		// file
+		if (thr->partial_update && in_filled == thr->in_pos) {
+			mythread_sync(thr->coder->mutex) {
+thr->outbuf->pos = thr->out_pos;
+thr->outbuf->decoder_in_pos = thr->in_pos;
+mythread_cond_signal(>coder->cond);
+			}
+		}
 		goto next_loop_unlocked;
 	}
 
-- 
2.25.1



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-14 Thread Jia Tan
> > > This:
> > >
> > > diff --git a/src/liblzma/common/stream_decoder_mt.c 
> > > b/src/liblzma/common/stream_decoder_mt.c
> > > --- a/src/liblzma/common/stream_decoder_mt.c
> > > +++ b/src/liblzma/common/stream_decoder_mt.c
> > > @@ -786,7 +786,7 @@ read_output_and_wait(struct lzma_stream_coder *coder,
> > > if (mythread_cond_timedwait(>cond,
> > > >mutex,
> > > wait_abs) != 0) {
> > > -   ret = LZMA_TIMED_OUT;
> > > +   ret = LZMA_OK;
> > > break;
> > > }
> > > } else {
> > >
> > > Should "fixes" it. At some point the main thread needs to check if the
> > > next thread is able to make progress or not and then return LZMA_OK so
> > > that the upper layer can figure out that no progress is made. Otherwise
> > > it stucks in the LZMA_TIMED_OUT loop.
> >
> > This fixes it just for xz, but if no timeout is specified it will
> > still deadlock.
> > I haven't looked at the code enough to understand how to fix it for both,
> > but I will start to look into that.
>
> No, it is for everyone. In the timeout case we need to check if the
> first thread in line can make progress. If we don't provide data new
> data to the thread and the thread consumed everything it had then the
> thread won't make progress. If the function gets invoked with 0 new data
> then we should return LZMA_OK at which point the upper layer (between XZ
> binary and the library) will notice that no progress is made and return
> an error to xz.

I tested this locally by setting the timeout for xz to 0 on line 57 of
src/xz/coder.c and it deadlocks on a truncated xz file. It seems like a
race condition because it deadlocks ~90% of the time in my liblzma tests.
When debugging, I can see that the main thread does not return from
lzma_code when the LZMA_FINISH action is given. When the decoder receives
LZMA_FINISH, it should probably signal to the worker threads that no more
input is coming and to exit with error if they are in the middle of a
block. Even if LZMA_FINISH is not given, we still need to avoid deadlock at
all costs. I will come up with a patch for this since it sounds like
a fun problem :)

> The above patch is not correct because if you do it as I suggeted then
> it is possible that an error is returned because the thread is slow and
> did not yet make progress.

I agree that the above patch only "fixes" the problem. Small timeouts and
slow threads will result in false positive deadlock detections.

> > I was following the conversation about the soft and hard memory limiting.
> > If a user wanted decoding to fail if it can't be done multithreaded and 
> > update
> > the memory limit as needed, that can't be done right now. It's a minor issue
> > that only matters for liblzma, but it would be nice to be able to update 
> > both
> > limits after decoding has started. I don't consider this a bug, more like
> > missing a nice to have feature. One easy solution is to add a new API
> > function to liblzma to update the soft memory limit for the multithreaded
> > decoder and do nothing / return an error on all other coders. I will add
> > a patch for this if you guys think it is a good idea.
>
> Maybe I missundertand you. But if you set memlimit_stop to the same
> value as memlimit_threading then you have either multi threaded
> decompression or none at all right?

I understand that setting memlimit_stop = memlimit_threading will result in
either threaded decoding or no decoding. The problem is only the
memlimit_stop can be updated. There is no way to update memlimit_threading
after initialization to ensure the "give me threaded decoding or none
at all" is still true. I don't think most users will care about
this, but it would be nice to provide the flexibility for those that do.

Jia Tan



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-11 Thread Sebastian Andrzej Siewior
On 2022-03-11 21:59:19 [+0800], Jia Tan wrote:
> > This:
> >
> > diff --git a/src/liblzma/common/stream_decoder_mt.c 
> > b/src/liblzma/common/stream_decoder_mt.c
> > --- a/src/liblzma/common/stream_decoder_mt.c
> > +++ b/src/liblzma/common/stream_decoder_mt.c
> > @@ -786,7 +786,7 @@ read_output_and_wait(struct lzma_stream_coder *coder,
> > if (mythread_cond_timedwait(>cond,
> > >mutex,
> > wait_abs) != 0) {
> > -   ret = LZMA_TIMED_OUT;
> > +   ret = LZMA_OK;
> > break;
> > }
> > } else {
> >
> > Should "fixes" it. At some point the main thread needs to check if the
> > next thread is able to make progress or not and then return LZMA_OK so
> > that the upper layer can figure out that no progress is made. Otherwise
> > it stucks in the LZMA_TIMED_OUT loop.
> 
> This fixes it just for xz, but if no timeout is specified it will
> still deadlock.
> I haven't looked at the code enough to understand how to fix it for both,
> but I will start to look into that.

No, it is for everyone. In the timeout case we need to check if the
first thread in line can make progress. If we don't provide data new
data to the thread and the thread consumed everything it had then the
thread won't make progress. If the function gets invoked with 0 new data
then we should return LZMA_OK at which point the upper layer (between XZ
binary and the library) will notice that no progress is made and return
an error to xz.
The above patch is not correct because if you do it as I suggeted then
it is possible that an error is returned because the thread is slow and
did not yet make progress.

> I was following the conversation about the soft and hard memory limiting.
> If a user wanted decoding to fail if it can't be done multithreaded and update
> the memory limit as needed, that can't be done right now. It's a minor issue
> that only matters for liblzma, but it would be nice to be able to update both
> limits after decoding has started. I don't consider this a bug, more like
> missing a nice to have feature. One easy solution is to add a new API
> function to liblzma to update the soft memory limit for the multithreaded
> decoder and do nothing / return an error on all other coders. I will add
> a patch for this if you guys think it is a good idea.

Maybe I missundertand you. But if you set memlimit_stop to the same
value as memlimit_threading then you have either multi threaded
decompression or none at all right?

> Jia Tan

Sebastian



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-11 Thread Jia Tan
> This:
>
> diff --git a/src/liblzma/common/stream_decoder_mt.c 
> b/src/liblzma/common/stream_decoder_mt.c
> --- a/src/liblzma/common/stream_decoder_mt.c
> +++ b/src/liblzma/common/stream_decoder_mt.c
> @@ -786,7 +786,7 @@ read_output_and_wait(struct lzma_stream_coder *coder,
> if (mythread_cond_timedwait(>cond,
> >mutex,
> wait_abs) != 0) {
> -   ret = LZMA_TIMED_OUT;
> +   ret = LZMA_OK;
> break;
> }
> } else {
>
> Should "fixes" it. At some point the main thread needs to check if the
> next thread is able to make progress or not and then return LZMA_OK so
> that the upper layer can figure out that no progress is made. Otherwise
> it stucks in the LZMA_TIMED_OUT loop.

This fixes it just for xz, but if no timeout is specified it will
still deadlock.
I haven't looked at the code enough to understand how to fix it for both,
but I will start to look into that.

> > The issue is with updating the memlimit with lzma_memlimit_set. As you
> > note in your comment in stream_decoder_mt_memconfig there is no way to
> > update memlimit_threading. If the memlimit_stop is set very low to
> > start, it will force memlimit_threading to be that same small value. I
> > could see users wanting to keep memlimit_threading and memlimit_stop
> > in sync or have memlimit_threading always be some function of
> > memlimit_stop (maybe memlimit_stop / 2 or something). I am not sure
> > what the best fix is for this at the moment, but I don't think having
> > only one of these values be configurable at runtime is the best idea.
> > Especially when the initialization forces memlimit_threading to be at
> > most memlimit_stop (which makes sense for almost every situation).
>
> The idea to have on limit to keep things going (no matter what) and the
> other to have reasonable limit at which point you don't want threads to
> be used.

I was following the conversation about the soft and hard memory limiting.
If a user wanted decoding to fail if it can't be done multithreaded and update
the memory limit as needed, that can't be done right now. It's a minor issue
that only matters for liblzma, but it would be nice to be able to update both
limits after decoding has started. I don't consider this a bug, more like
missing a nice to have feature. One easy solution is to add a new API
function to liblzma to update the soft memory limit for the multithreaded
decoder and do nothing / return an error on all other coders. I will add
a patch for this if you guys think it is a good idea.

Jia Tan



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-10 Thread Sebastian Andrzej Siewior
On 2022-03-10 22:16:27 [+0800], Jia Tan wrote:
> I started writing tests in the new framework and I found one bug and
> one issue. If you want to check out the tests I have so far, here is a
> link to check out my progress:
> https://github.com/JiaT75/XZ_Utils_Unofficial/tree/test_multithreaded_decoder
> 
> The bug is with truncated xz files. In multithreaded mode, if a file
> has been corrupted and is missing the end, deadlock occurs. An easy
> way to recreate this is by using the truncate command:
> truncate -s 3 some_multiblock_file.xz
> 
> And then:
> xz -dk --verbose some_multiblock_file.xz --threads=2
> 
> This will result in a deadlock in multithreaded decoding, but not an
> error in single threaded decoding.

This:

diff --git a/src/liblzma/common/stream_decoder_mt.c 
b/src/liblzma/common/stream_decoder_mt.c
--- a/src/liblzma/common/stream_decoder_mt.c
+++ b/src/liblzma/common/stream_decoder_mt.c
@@ -786,7 +786,7 @@ read_output_and_wait(struct lzma_stream_coder *coder,
if (mythread_cond_timedwait(>cond,
>mutex,
wait_abs) != 0) {
-   ret = LZMA_TIMED_OUT;
+   ret = LZMA_OK;
break;
}
} else {

Should "fixes" it. At some point the main thread needs to check if the
next thread is able to make progress or not and then return LZMA_OK so
that the upper layer can figure out that no progress is made. Otherwise
it stucks in the LZMA_TIMED_OUT loop.

> The issue is with updating the memlimit with lzma_memlimit_set. As you
> note in your comment in stream_decoder_mt_memconfig there is no way to
> update memlimit_threading. If the memlimit_stop is set very low to
> start, it will force memlimit_threading to be that same small value. I
> could see users wanting to keep memlimit_threading and memlimit_stop
> in sync or have memlimit_threading always be some function of
> memlimit_stop (maybe memlimit_stop / 2 or something). I am not sure
> what the best fix is for this at the moment, but I don't think having
> only one of these values be configurable at runtime is the best idea.
> Especially when the initialization forces memlimit_threading to be at
> most memlimit_stop (which makes sense for almost every situation).

The idea to have on limit to keep things going (no matter what) and the
other to have reasonable limit at which point you don't want threads to
be used.

> I will continue to write more tests and then review the code itself.
> Nice job to both of you for getting this feature as far as it is
> already.
> 
> Jia Tan

Sebastian



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-10 Thread Jia Tan
> > Testing is welcome. It would be nice if someone who has 12-24 hardware
> > threads could test if it scales well. One needs a file with like a
> > hundred blocks, so with the default xz -6 that means a 2.5 gigabyte
> > uncompressed file, smaller if one uses, for example, --block-size=8MiB
> > when compressing.
>
> I made
> StreamBlocks  CompOffsetUncompOffsetCompSize  
> UncompSize  Ratio  Check  Padding
>  1   777   0   0   2.386.777.028  
> 19.540.326.400  0,122  CRC640

Thanks for doing this testing! Looks like the threading scales fairly
well on large files like this. I don't have a machine with resources
like this, so I am glad you were able to do this.

> > If the input is broken, it should produce as much output as the
> > single-threaded stable version does. That is, if one thread detects an
> > error, the data before that point is first flushed out before the error
> > is reported. This has pros and cons. It would be easy to add a flag to
> > allow switching to fast error reporting for applications that don't
> > care about partial output from broken files.

I started writing tests in the new framework and I found one bug and
one issue. If you want to check out the tests I have so far, here is a
link to check out my progress:
https://github.com/JiaT75/XZ_Utils_Unofficial/tree/test_multithreaded_decoder

The bug is with truncated xz files. In multithreaded mode, if a file
has been corrupted and is missing the end, deadlock occurs. An easy
way to recreate this is by using the truncate command:
truncate -s 3 some_multiblock_file.xz

And then:
xz -dk --verbose some_multiblock_file.xz --threads=2

This will result in a deadlock in multithreaded decoding, but not an
error in single threaded decoding.

The issue is with updating the memlimit with lzma_memlimit_set. As you
note in your comment in stream_decoder_mt_memconfig there is no way to
update memlimit_threading. If the memlimit_stop is set very low to
start, it will force memlimit_threading to be that same small value. I
could see users wanting to keep memlimit_threading and memlimit_stop
in sync or have memlimit_threading always be some function of
memlimit_stop (maybe memlimit_stop / 2 or something). I am not sure
what the best fix is for this at the moment, but I don't think having
only one of these values be configurable at runtime is the best idea.
Especially when the initialization forces memlimit_threading to be at
most memlimit_stop (which makes sense for almost every situation).

I will continue to write more tests and then review the code itself.
Nice job to both of you for getting this feature as far as it is
already.

Jia Tan



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-07 Thread Sebastian Andrzej Siewior
On 2022-03-07 01:08:40 [+0200], Lasse Collin wrote:
> Hello!
Hi,

> I committed something. The liblzma part shouldn't need any big changes,
> I hope. There are a few FIXMEs but some of them might actually be fine
> as is. The xz side is just an initial commit, there isn't even
> --memlimit-threading option yet (I will add it).
> 
> Testing is welcome. It would be nice if someone who has 12-24 hardware
> threads could test if it scales well. One needs a file with like a
> hundred blocks, so with the default xz -6 that means a 2.5 gigabyte
> uncompressed file, smaller if one uses, for example, --block-size=8MiB
> when compressing.

I made
StreamBlocks  CompOffsetUncompOffsetCompSize  
UncompSize  Ratio  Check  Padding
 1   777   0   0   2.386.777.028  
19.540.326.400  0,122  CRC640

one block is 25.165.824.

32 cores:

| $ time ./src/xz/xz -tv tars.tar.xz -T0
| tars.tar.xz (1/1)
|   100 %  2.276,2 MiB / 18,2 GiB = 0,122   1,6 GiB/s   0:11
 
| 
| real0m11,162s
| user5m44,108s
| sys 0m1,988s

256 cores:
| $ time ./src/xz/xz -tv tars.tar.xz -T0
| tars.tar.xz (1/1)
|   100 %  2.276,2 MiB / 18,2 GiB = 0,122   3,4 GiB/s   0:05
 
| 
| real0m5,403s
| user4m0,298s
| sys 0m24,315s

it appears to work :) If I see this right, then the file is too small or
xz too fast but it does not appear that xz manages to create more than
100 threads.

and decompression to disk
| $ time ~bigeasy/xz/src/xz/xz -dvk tars.tar.xz -T0
| tars.tar.xz (1/1)
|   100 %  2.276,2 MiB / 18,2 GiB = 0,122   746 MiB/s   0:24
 
| 
| real0m25,064s
| user3m49,175s
| sys 0m29,748s

appears to block at around 10 to 14 threads or so and then it hangs at the end
until disk I/O finishes. Decent.
Assuming disk I/O is slow, say 10MiB/s, and we would 388 CPUs (blocks/2)
then it would decompress the whole file into memory and stuck on disk
I/O?

In terms of scaling, xz -tv of that same file with with -T1…64:

| CPUS: 1
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 85 MiB/s, 3:38
| 
| real  3m38,047s
| user  3m37,404s
| sys   0m0,626s
| CPUS: 2
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 171 MiB/s, 1:49
| 
| real  1m49,296s
| user  3m41,529s
| sys   0m1,433s
| CPUS: 3
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 256 MiB/s, 1:12
| 
| real  1m12,832s
| user  3m40,929s
| sys   0m1,199s
| CPUS: 4
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 341 MiB/s, 0:54
| 
| real  0m54,616s
| user  3m40,596s
| sys   0m1,161s
| CPUS: 5
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 425 MiB/s, 0:43
| 
| real  0m43,900s
| user  3m41,306s
| sys   0m1,038s
| CPUS: 6
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 510 MiB/s, 0:36
| 
| real  0m36,587s
| user  3m41,527s
| sys   0m1,076s
| CPUS: 7
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 591 MiB/s, 0:31
| 
| real  0m31,568s
| user  3m41,559s
| sys   0m1,079s
| CPUS: 8
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 676 MiB/s, 0:27
| 
| real  0m27,579s
| user  3m42,098s
| sys   0m0,966s
| CPUS: 9
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 758 MiB/s, 0:24
| 
| real  0m24,614s
| user  3m42,318s
| sys   0m1,119s
| CPUS: 10
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 844 MiB/s, 0:22
| 
| real  0m22,111s
| user  3m41,353s
| sys   0m1,152s
| CPUS: 11
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 923 MiB/s, 0:20
| 
| real  0m20,219s
| user  3m43,327s
| sys   0m1,311s
| CPUS: 12
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,0 GiB/s, 0:18
| 
| real  0m18,442s
| user  3m41,710s
| sys   0m1,110s
| CPUS: 13
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,1 GiB/s, 0:17
| 
| real  0m17,067s
| user  3m42,102s
| sys   0m1,176s
| CPUS: 14
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,1 GiB/s, 0:15
| 
| real  0m15,861s
| user  3m41,978s
| sys   0m1,171s
| CPUS: 15
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,2 GiB/s, 0:14
| 
| real  0m14,866s
| user  3m42,247s
| sys   0m1,108s
| CPUS: 16
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,3 GiB/s, 0:13
| 
| real  0m13,936s
| user  3m41,086s
| sys   0m1,017s
| CPUS: 17
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,4 GiB/s, 0:13
| 
| real  0m13,200s
| user  3m42,171s
| sys   0m1,137s
| CPUS: 18
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,5 GiB/s, 0:12
| 
| real  0m12,539s
| user  3m43,286s
| sys   0m1,355s
| CPUS: 19
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,5 GiB/s, 0:11
| 
| real  0m11,949s
| user  3m44,354s
| sys   0m1,111s
| CPUS: 20
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,6 GiB/s, 0:11
| 
| real  0m11,216s
| user  3m42,635s
| sys   0m1,202s
| CPUS: 21
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,7 GiB/s, 0:10
| 
| real  0m10,655s
| user  3m41,742s
| sys   0m1,123s
| CPUS: 22
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,8 GiB/s, 0:10
| 
| real  0m10,232s
| user  3m42,328s
| sys   0m1,211s
| CPUS: 23
| tars.tar.xz: 2.276,2 MiB / 18,2 GiB = 0,122, 1,9 GiB/s, 0:09
| 
| real  

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-06 Thread Lasse Collin
Hello!

I committed something. The liblzma part shouldn't need any big changes,
I hope. There are a few FIXMEs but some of them might actually be fine
as is. The xz side is just an initial commit, there isn't even
--memlimit-threading option yet (I will add it).

Testing is welcome. It would be nice if someone who has 12-24 hardware
threads could test if it scales well. One needs a file with like a
hundred blocks, so with the default xz -6 that means a 2.5 gigabyte
uncompressed file, smaller if one uses, for example, --block-size=8MiB
when compressing.

If the input is broken, it should produce as much output as the
single-threaded stable version does. That is, if one thread detects an
error, the data before that point is first flushed out before the error
is reported. This has pros and cons. It would be easy to add a flag to
allow switching to fast error reporting for applications that don't
care about partial output from broken files.

-- 
Lasse Collin



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-02-07 Thread Sebastian Andrzej Siewior
On 2022-02-07 01:46:32 [+0200], Lasse Collin wrote:
> I now plan to use memlimit_threading and memlimit_stop in the lzma_mt
> structure. Documentation is still needed but hopefully those are a bit
> more obvious.

oki

…

[ long update ]

> I added support for lzma_get_progress(). It is needed to make xz -v
> progress indicator useful.

Thank you for the update.

Sebastian



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-02-06 Thread Lasse Collin
On 2021-12-31 Sebastian Andrzej Siewior wrote:
> On 2021-12-15 23:33:58 [+0200], Lasse Collin wrote:
> > Yes. It's fairly simple from implementation point of view but is it
> > clear enough for the users, I'm not sure.
> > 
> > I suppose the alternative is having just one limit value and a flag
> > to tell if it is a soft limit (so no limit for single-threaded
> > case) or a hard limit (return LZMA_MEM_ERROR if too low for even
> > single thread). Having separate soft and hard limits instead can
> > achieve the same and a little more, so I think I'll choose the
> > two-value approach and hope it's clear enough for users.  
> 
> The value approach might work. I'm not sure if the term `soft' and
> `hard' are good here. Using `memlimit' and `memlimit_threaded' (or so)
> might make more obvious and easier to understand.
> But then this just some documentation that needs to be read and
> understood so maybe `softlimit' and `hardlimit' will work just fine.

I now plan to use memlimit_threading and memlimit_stop in the lzma_mt
structure. Documentation is still needed but hopefully those are a bit
more obvious.

> > I was hoping to get this finished by Christmas but due to a recent
> > sad event, late January is my target for the next alpha release
> > now.

And I'm late again. :-(

This is more work than I had expected because there unfortunately are a
few problems in the code and fixing them all requires quite significant
changes (and I'm slow). As a bonus, working on this made me notice a few
small bugs in the old liblzma code too (not yet committed).

The following tries to explain some of the problems and what I have
done locally. I don't have code to show yet because it still contains
too many small FIXMEs but, as unbelievable as it might sound, this will
get done. I need a few more days; I have other things I must do too.


The biggest issue is handling of memory usage and threaded vs. direct
mode. The memory usage limiting code makes assumptions that are true
with the most common files but there are situations where these
assumptions fail:

(1) If a non-first Block requires a lot more memory than the first
Block and so the memory limit would be exceeded in threaded mode,
the decoder will not switch to direct mode even with
LZMA_MEMLIMIT_COMPLETE. Instead the decoder proceeds with one
thread and uses as much memory as that needs.

(2) If a non-first Block lacks size info in its Block Header, the
decoder won't switch to direct mode. It returns LZMA_PROG_ERROR
instead.

(3) The per-thread input buffers can grow as bigger Blocks are seen but
the buffers cannot shrink. This has pros and cons. It's a problem if
a single Block is very big and others are not.

I thought it's better to first decode the Block Header to
coder->block_options and then, based on the facts from that Block
Header, determine memory usage and how to proceed (including switching
to/from direct mode). This way there is no need to assume or expect
anything. (coder->block_options need to be copied to a thread-specific
structure before initializing the decoder.)

For direct mode, I added separate SEQ states for it. This also helps
making the code more similar to the single-threaded decoder in both
looks and behavior. I hope that with memlimit_threading = 0 the
threaded version can have identical externally-visible behavior as the
original single-threaded version. This way xz doesn't need both
functions (the single-threaded function is still needed if built with
--disable-threads).


Corner cases of the buffer-to-buffer API:

(4) In some use cases there might be long pauses where no new input is
available (for example, sending a live log file over network with
compression). It is essential that the decoder will still provide
all output that is easily possible from the input so far. That is,
if the decoder was called without providing any new input, it might
need to be handled specially.

SEQ_BLOCK_HEADER and SEQ_INDEX return immediately if the application
isn't providing any new input data, and so eventually lzma_code()
will return LZMA_BUF_ERROR even when there would be output
available from the worker threads. try_copy_decoded() could be
called earlier but there is more to fix (see (5) and (6)).

(Also remember my comment above that I changed the code so that
Block Header is decoded first before getting a thread. That adds
one more SEQ point where waiting for output is needed.)

(5) The decoder must work when the application provides an output
buffer whose size is exactly the uncompressed size of the file.
This means that one cannot simply use *out_pos == out_size to
determine when to return LZMA_OK. Perhaps the decoder hasn't marked
its lzma_outbuf as finished but no more output will be coming, or
there is an empty Block (empty Blocks perhaps shouldn't have been
allowed in the format at all but it's too late for that).

In short, 

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-12-31 Thread Sebastian Andrzej Siewior
On 2021-12-15 23:33:58 [+0200], Lasse Collin wrote:
> Yes. It's fairly simple from implementation point of view but is it
> clear enough for the users, I'm not sure.
> 
> I suppose the alternative is having just one limit value and a flag to
> tell if it is a soft limit (so no limit for single-threaded case) or a
> hard limit (return LZMA_MEM_ERROR if too low for even single thread).
> Having separate soft and hard limits instead can achieve the same and a
> little more, so I think I'll choose the two-value approach and hope it's
> clear enough for users.

The value approach might work. I'm not sure if the term `soft' and
`hard' are good here. Using `memlimit' and `memlimit_threaded' (or so)
might make more obvious and easier to understand.
But then this just some documentation that needs to be read and
understood so maybe `softlimit' and `hardlimit' will work just fine.

> I was hoping to get this finished by Christmas but due to a recent sad
> event, late January is my target for the next alpha release now. I hope
> to include a few other things too, including some of Jia Tan's patches
> (we've chatted outside the xz-devel list). Thank you for understanding.

Okay.

Sebastian



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-12-15 Thread Lasse Collin
On 2021-12-04 Sebastian Andrzej Siewior wrote:
> On 2021-11-30 00:25:11 [+0200], Lasse Collin wrote:
> > Separate soft and hard limits might be convenient from
> > implementation point of view though. xz would need --memlimit-soft
> > (or some better name) which would always have some default value
> > (like MemAvailable). The threaded decoder in liblzma would need to
> > take two memlimit values. Then there would be no need for an enum
> > (or a flag) to specify the memlimit mode (assuming that
> > LZMA_MEMLIMIT_THREAD is removed).  
> 
> Ah I see. So one would say soft-limit 80MiB, hard-limit 2^60bytes and
> would get no threading at all / LZMA_MEMLIMIT_NO_THREAD. And with soft
> 1GiB, hard 2^60bytes would get the threading mode. (2^60 is made up
> no limit).

Yes. It's fairly simple from implementation point of view but is it
clear enough for the users, I'm not sure.

I suppose the alternative is having just one limit value and a flag to
tell if it is a soft limit (so no limit for single-threaded case) or a
hard limit (return LZMA_MEM_ERROR if too low for even single thread).
Having separate soft and hard limits instead can achieve the same and a
little more, so I think I'll choose the two-value approach and hope it's
clear enough for users.

> > I wonder if relying on the lzma_mt struct is useful for the decoder.
> > Perhaps the options could be passed directly as arguments as there
> > are still 2-3 fewer than needed for the encoder.  
> 
> Thre is
> - num threads
> - flags
> - memlimit
> - timeout
> 
> One struct to rule them all and you could extend it without the need
> to change the ABI.
> I took one of the reserved ones for the memlimit. If you put the two
> memory limits and number of threads in one init/configure function
> then only flags and timeout is left. Maybe that would be enought then.

You have a valid point. Either approach works, new functions can be
added if needed for extending the ABI, but having just one can be nice
in the long term.

I was hoping to get this finished by Christmas but due to a recent sad
event, late January is my target for the next alpha release now. I hope
to include a few other things too, including some of Jia Tan's patches
(we've chatted outside the xz-devel list). Thank you for understanding.

-- 
Lasse Collin



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-12-04 Thread Sebastian Andrzej Siewior
On 2021-11-30 00:25:11 [+0200], Lasse Collin wrote:
> Hello!
Hello,

> On 2021-02-05 Sebastian Andrzej Siewior wrote:
> > - Added enum `lzma_memlimit_opt' to lzma_stream_decoder_mt() as an
> >   init parameter. The idea is to specify how to obey the memory limit
> >   so the user can keep using one API and not worry to fail due to the
> >   memory limit. Lets assume the archive has a 9MiB dictionary, 24MiB
> >   block of uncompressed data. The archive contains two compressed
> >   blocks of 10 MiB each. Using two threads, the memory requirement is
> >   roughly (9 + 24 + 10) * 2 = 86 MiB
> > 
> >   On a system with 64 MiB of memory with additional 128MiB of swap it
> >   likely leads to the use of (say 30 MiB) swap memory during
> >   decompression which will slow down the whole operation.
> >   The synchronous API would do just fine with only 9 MiB of memory.
> > 
> >   So to not complicate things, invoking lzma_stream_decoder_mt() with
> >   a memory limit of 32 MiB three scenarios are possible:
> >   - LZMA_MEMLIMIT_THREAD
> > One thread requires 43MiB of memory and would exceed the memory
> > limit. However, continue with one thread instead of possible two.
> > 
> >   - LZMA_MEMLIMIT_NO_THREAD
> > One thread requires 43MiB of memory and would exceed the memory
> > limit. Fallback to the synchronous API without buffered input /
> > output memory.
> > 
> >   - LZMA_MEMLIMIT_COMPLETE
> > In this scenario it would behave like LZMA_MEMLIMIT_NO_THREAD.
> > However, with a dictionary size > 32MiB it would abort.
> 
> In the old single-threaded code, if no memory usage limit is specified
> the worst case memory usage with LZMA2 is about 4 GiB (the format
> allows 4 GiB dict although the current encoder only supports 1536 MiB).
> With the threaded decoder it's the same with LZMA_MEMLIMIT_NO_THREAD.
> 
> However, LZMA_MEMLIMIT_THREAD sounds a bit scary. There are no practical
> limits to the block size so there can be a .xz file that makes the
> decoder allocate a huge amount of memory. It doesn't even need to be an
> intentionally malicious file, it just needs to have the size fields
> present. Thus, I think LZMA_MEMLIMIT_THREAD should be removed.
> One-thread multi-threaded mode will still be used with
> LZMA_MEMLIMIT_NO_THREAD if the limit is high enough.

You are right. I can't think of a reason to force one thread.

> LZMA_MEMLIMIT_NO_THREAD should be the default in xz when no memory
> usage limit has been explicitly specified. There needs to be a default
> "soft limit" (the MemAvailable method is such) that will drop xz to
> single-threaded mode if the soft limit is too high for threaded mode
> (even with just one thread).
> 
> LZMA_MEMLIMIT_COMPLETE could be the mode to use when a memlimit is
> explicitly specified (a "hard limit") on the xz command line. This would
> match the existing behavior of the old single-threaded decoder. It
> would be good to have a way to specify a soft limit on the xz command
> line too.
> 
> It could make sense to have both soft and hard limit at the same time
> but perhaps it gets too confusing: Soft limit that would be used to
> restrict the number of threads (and even drop to single-threaded mode)
> and hard limit which can return LZMA_MEMLIMIT_ERROR. If one is fine to
> use 300 MiB in threaded mode but still wants to allow up to 600 MiB in
> case the file *really* requires that much even in single-threaded mode,
> then this would be useful.

I see. This might sounds like advanced usage.
Looking from the package-manager point of view (installs packages and
needs it decompressed), the work flow is "I have here 80MiB of free
memory but I need to have this decompressed so do what needs to be done
in order fulfil your job." So if the dictionary is 128MiB in size, do it
without threading and hope for swap space but this *used* to work before
that, too.

> Separate soft and hard limits might be convenient from implementation
> point of view though. xz would need --memlimit-soft (or some better
> name) which would always have some default value (like MemAvailable).
> The threaded decoder in liblzma would need to take two memlimit values.
> Then there would be no need for an enum (or a flag) to specify the
> memlimit mode (assuming that LZMA_MEMLIMIT_THREAD is removed).

Ah I see. So one would say soft-limit 80MiB, hard-limit 2^60bytes and
would get no threading at all / LZMA_MEMLIMIT_NO_THREAD. And with soft
1GiB, hard 2^60bytes would get the threading mode. (2^60 is made up
no limit).

> Extra idea, maybe useless: The --no-adjust option could be used to
> specify that if the specified number of threads isn't possible due to
> a memlimit then xz will abort. This is slightly weird as it doesn't
> provide real performance guarantees anyway (block sizes could vary a
> lot) but it's easy to implement if it is wanted.
> 
> I wonder if relying on the lzma_mt struct is useful for the decoder.
> Perhaps the options could be passed directly as arguments as 

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-11-29 Thread Lasse Collin
Hello!

On 2021-02-05 Sebastian Andrzej Siewior wrote:
> - Added enum `lzma_memlimit_opt' to lzma_stream_decoder_mt() as an
>   init parameter. The idea is to specify how to obey the memory limit
>   so the user can keep using one API and not worry to fail due to the
>   memory limit. Lets assume the archive has a 9MiB dictionary, 24MiB
>   block of uncompressed data. The archive contains two compressed
>   blocks of 10 MiB each. Using two threads, the memory requirement is
>   roughly (9 + 24 + 10) * 2 = 86 MiB
> 
>   On a system with 64 MiB of memory with additional 128MiB of swap it
>   likely leads to the use of (say 30 MiB) swap memory during
>   decompression which will slow down the whole operation.
>   The synchronous API would do just fine with only 9 MiB of memory.
> 
>   So to not complicate things, invoking lzma_stream_decoder_mt() with
>   a memory limit of 32 MiB three scenarios are possible:
>   - LZMA_MEMLIMIT_THREAD
> One thread requires 43MiB of memory and would exceed the memory
> limit. However, continue with one thread instead of possible two.
> 
>   - LZMA_MEMLIMIT_NO_THREAD
> One thread requires 43MiB of memory and would exceed the memory
> limit. Fallback to the synchronous API without buffered input /
> output memory.
> 
>   - LZMA_MEMLIMIT_COMPLETE
> In this scenario it would behave like LZMA_MEMLIMIT_NO_THREAD.
> However, with a dictionary size > 32MiB it would abort.

In the old single-threaded code, if no memory usage limit is specified
the worst case memory usage with LZMA2 is about 4 GiB (the format
allows 4 GiB dict although the current encoder only supports 1536 MiB).
With the threaded decoder it's the same with LZMA_MEMLIMIT_NO_THREAD.

However, LZMA_MEMLIMIT_THREAD sounds a bit scary. There are no practical
limits to the block size so there can be a .xz file that makes the
decoder allocate a huge amount of memory. It doesn't even need to be an
intentionally malicious file, it just needs to have the size fields
present. Thus, I think LZMA_MEMLIMIT_THREAD should be removed.
One-thread multi-threaded mode will still be used with
LZMA_MEMLIMIT_NO_THREAD if the limit is high enough.

LZMA_MEMLIMIT_NO_THREAD should be the default in xz when no memory
usage limit has been explicitly specified. There needs to be a default
"soft limit" (the MemAvailable method is such) that will drop xz to
single-threaded mode if the soft limit is too high for threaded mode
(even with just one thread).

LZMA_MEMLIMIT_COMPLETE could be the mode to use when a memlimit is
explicitly specified (a "hard limit") on the xz command line. This would
match the existing behavior of the old single-threaded decoder. It
would be good to have a way to specify a soft limit on the xz command
line too.

It could make sense to have both soft and hard limit at the same time
but perhaps it gets too confusing: Soft limit that would be used to
restrict the number of threads (and even drop to single-threaded mode)
and hard limit which can return LZMA_MEMLIMIT_ERROR. If one is fine to
use 300 MiB in threaded mode but still wants to allow up to 600 MiB in
case the file *really* requires that much even in single-threaded mode,
then this would be useful.

Separate soft and hard limits might be convenient from implementation
point of view though. xz would need --memlimit-soft (or some better
name) which would always have some default value (like MemAvailable).
The threaded decoder in liblzma would need to take two memlimit values.
Then there would be no need for an enum (or a flag) to specify the
memlimit mode (assuming that LZMA_MEMLIMIT_THREAD is removed).

Extra idea, maybe useless: The --no-adjust option could be used to
specify that if the specified number of threads isn't possible due to
a memlimit then xz will abort. This is slightly weird as it doesn't
provide real performance guarantees anyway (block sizes could vary a
lot) but it's easy to implement if it is wanted.

I wonder if relying on the lzma_mt struct is useful for the decoder.
Perhaps the options could be passed directly as arguments as there are
still 2-3 fewer than needed for the encoder.

I've made some other minor edits locally already so I would prefer to
*not* get new patch revisions until I have committed something.
Comments are very welcome. :-)

Thanks!

-- 
Lasse Collin



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-08-10 Thread Sebastian Andrzej Siewior
On 2021-07-20 02:09:31 [+0200], Guillem Jover wrote:
> Hi!
Hi Guillem,

> On Fri, 2021-02-05 at 00:11:57 +0100, Sebastian Andrzej Siewior wrote:
> > From: Sebastian Andrzej Siewior 
> 
> I've only skimmer very quickly over the patch, but I've been running
> it on my system in addition to a locally modified dpkg that uses this
> new support, and it seems to be working great. :)
> 
>   
> 

Thanks for letting me know. You used LZMA_MEMLIMIT_NO_THREAD. How
lovely. I added that piece exactly to satisfy dpkg's expectations hoping
that a buildd with 64MiB RAM will behave as it did with the single
threaded implementation :)

> Thanks,
> Guillem

Sebastian



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-07-20 Thread Lasse Collin
Hello!

On 2021-07-20 Guillem Jover wrote:
> I've only skimmer very quickly over the patch, but I've been running
> it on my system in addition to a locally modified dpkg that uses this
> new support, and it seems to be working great. :)

Great to hear, thanks! :-) Unfortunately I don't have any news. :-(

-- 
Lasse Collin



Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-07-19 Thread Guillem Jover
Hi!

On Fri, 2021-02-05 at 00:11:57 +0100, Sebastian Andrzej Siewior wrote:
> From: Sebastian Andrzej Siewior 

I've only skimmer very quickly over the patch, but I've been running
it on my system in addition to a locally modified dpkg that uses this
new support, and it seems to be working great. :)

  


Thanks,
Guillem