Hello! Once again, sorry for the delay. I will be busy the rest of the week. I will get back to xz early next week.
On 2022-03-07 Sebastian Andrzej Siewior wrote: > 32 cores: > > | $ time ./src/xz/xz -tv tars.tar.xz -T0 > | tars.tar.xz (1/1) > | 100 % 2.276,2 MiB / 18,2 GiB = 0,122 1,6 GiB/s 0:11 > | > | real 0m11,162s > | user 5m44,108s > | sys 0m1,988s > > 256 cores: > | $ time ./src/xz/xz -tv tars.tar.xz -T0 > | tars.tar.xz (1/1) > | 100 % 2.276,2 MiB / 18,2 GiB = 0,122 3,4 GiB/s 0:05 > | > | real 0m5,403s > | user 4m0,298s > | sys 0m24,315s > > it appears to work :) If I see this right, then the file is too small > or xz too fast but it does not appear that xz manages to create more > than 100 threads. Thanks! The scaling is definitely good enough. :-) Even if there was room for improvement I won't think about it much for now. A curious thing above is the ratio of user-to-sys time. With more threads a lot more is spent in syscalls. > and decompression to disk > | $ time ~bigeasy/xz/src/xz/xz -dvk tars.tar.xz -T0 > | tars.tar.xz (1/1) > | 100 % 2.276,2 MiB / 18,2 GiB = 0,122 746 MiB/s 0:24 > | > | real 0m25,064s > | user 3m49,175s > | sys 0m29,748s > > appears to block at around 10 to 14 threads or so and then it hangs > at the end until disk I/O finishes. Decent. > Assuming disk I/O is slow, say 10MiB/s, and we would 388 CPUs > (blocks/2) then it would decompress the whole file into memory and > stuck on disk I/O? Yes. I wonder if the way xz does I/O might affect performance. Every time the 8192-byte input buffer is empty (that is, liblzma has consumed it), xz will block reading more input until another 8192 bytes have been read. As long as threads can consume more input, each call to lzma_code() will use all 8192 bytes. Each call might pass up to 8192 bytes of output from liblzma to xz too. If compression ratio is high and reading input isn't very fast, then perhaps performance might go down because blocking on input prevents xz from producing more output. Only when liblzma cannot consume more input xz will produce output at full speed. That is, I wonder if with slow input the output speed will be limited until the input buffers inside liblzma have been filled. My explanation isn't very good, sorry. Ideally input and output would be in different threads but the liblzma API doesn't really allow that. Based on your benchmarks the current method likely is easily good enough in practice. > In terms of scaling, xz -tv of that same file with with -T1…64: [...] > time of 1 CPU / 64 = (3 * 60 + 38) / 64 = 3.40625 > > Looks okay. Yes, thanks! > > If the input is broken, it should produce as much output as the > > single-threaded stable version does. That is, if one thread detects > > an error, the data before that point is first flushed out before > > the error is reported. This has pros and cons. It would be easy to > > add a flag to allow switching to fast error reporting for > > applications that don't care about partial output from broken > > files. > > I guess most of them don't care because an error is usually an abort, > the sooner, the better. It is probably the exception that you want > decompress it despite the error and maybe go on with the next block > and see what is left. I agree. Over 99 % of the time any error means that the whole output will be discarded. However, I would like to make the threaded decoder to (optionally) have very similar external behavior as the single-threaded version for cases where it might matter. It's not perfect at the moment but I think it's decent enough (bugs excluded). Truncated files are a special case of corrupt input because, unless LZMA_FINISH is used, liblzma cannot know if the input is truncated or if there is merely a pause in the input for some application-specific reason. That can result in LZMA_BUF_ERROR but if the application knows that such pauses are possible then it can handle LZMA_BUF_ERROR specially and continue decoding when more input is available. -- Lasse Collin