Once again, sorry for the delay. I will be busy the rest of the week. I
will get back to xz early next week.

On 2022-03-07 Sebastian Andrzej Siewior wrote:
> 32 cores:
> | $ time ./src/xz/xz -tv tars.tar.xz -T0
> | tars.tar.xz (1/1)
> |   100 %      2.276,2 MiB / 18,2 GiB = 0,122   1,6 GiB/s       0:11
> | 
> | real    0m11,162s
> | user    5m44,108s
> | sys     0m1,988s
> 256 cores:
> | $ time ./src/xz/xz -tv tars.tar.xz -T0
> | tars.tar.xz (1/1)
> |   100 %      2.276,2 MiB / 18,2 GiB = 0,122   3,4 GiB/s       0:05
> | 
> | real    0m5,403s
> | user    4m0,298s
> | sys     0m24,315s
> it appears to work :) If I see this right, then the file is too small
> or xz too fast but it does not appear that xz manages to create more
> than 100 threads.

Thanks! The scaling is definitely good enough. :-) Even if there was
room for improvement I won't think about it much for now.

A curious thing above is the ratio of user-to-sys time. With more
threads a lot more is spent in syscalls.

> and decompression to disk
> | $ time ~bigeasy/xz/src/xz/xz -dvk tars.tar.xz -T0
> | tars.tar.xz (1/1)
> |   100 %      2.276,2 MiB / 18,2 GiB = 0,122   746 MiB/s       0:24
> | 
> | real    0m25,064s
> | user    3m49,175s
> | sys     0m29,748s
> appears to block at around 10 to 14 threads or so and then it hangs
> at the end until disk I/O finishes. Decent.
> Assuming disk I/O is slow, say 10MiB/s, and we would 388 CPUs
> (blocks/2) then it would decompress the whole file into memory and
> stuck on disk I/O?


I wonder if the way xz does I/O might affect performance. Every time
the 8192-byte input buffer is empty (that is, liblzma has consumed it),
xz will block reading more input until another 8192 bytes have been
read. As long as threads can consume more input, each call to
lzma_code() will use all 8192 bytes. Each call might pass up to 8192
bytes of output from liblzma to xz too. If compression ratio is high
and reading input isn't very fast, then perhaps performance might go
down because blocking on input prevents xz from producing more output.
Only when liblzma cannot consume more input xz will produce output at
full speed.

That is, I wonder if with slow input the output speed will be limited
until the input buffers inside liblzma have been filled. My explanation
isn't very good, sorry.

Ideally input and output would be in different threads but the liblzma
API doesn't really allow that. Based on your benchmarks the current
method likely is easily good enough in practice.

> In terms of scaling, xz -tv of that same file with with -T1…64:
> time of 1 CPU / 64 = (3 * 60 + 38) / 64 = 3.40625
> Looks okay.

Yes, thanks!

> > If the input is broken, it should produce as much output as the
> > single-threaded stable version does. That is, if one thread detects
> > an error, the data before that point is first flushed out before
> > the error is reported. This has pros and cons. It would be easy to
> > add a flag to allow switching to fast error reporting for
> > applications that don't care about partial output from broken
> > files.  
> I guess most of them don't care because an error is usually an abort,
> the sooner, the better. It is probably the exception that you want
> decompress it despite the error and maybe go on with the next block
> and see what is left.

I agree. Over 99 % of the time any error means that the whole output
will be discarded. However, I would like to make the threaded decoder
to (optionally) have very similar external behavior as the
single-threaded version for cases where it might matter. It's not
perfect at the moment but I think it's decent enough (bugs excluded).

Truncated files are a special case of corrupt input because, unless
LZMA_FINISH is used, liblzma cannot know if the input is truncated or
if there is merely a pause in the input for some application-specific
reason. That can result in LZMA_BUF_ERROR but if the application knows
that such pauses are possible then it can handle LZMA_BUF_ERROR
specially and continue decoding when more input is available.

Lasse Collin

Reply via email to