Re: [xz-devel] [PATCH v2] liblzma: Add multi-threaded decoder

2021-04-11 Thread Sebastian Andrzej Siewior
On 2021-04-11 21:48:39 [+0300], Lasse Collin wrote:
> 
> Yes, sorry, I didn't read carefully enough.

no worries.

> After the XZ for Java 1.9 release I wanted a short break but it's
> already been a month. Outside the Java version there are multiple XZ
> things to do (not all on xz-devel) and I think I should do a few of the
> small ones first to shorten the list a little before your big threading
> patch. I know my handling of the threading patch may at this point
> appear both rude and ridiculous but I just don't get enough things
> done, I'm really sorry. :-(

Okay. So last time you threw a few .xz files at me which did improve the
decoder logic which was nice. If you have any of these in your cabinet…

Other than that, I keep waiting until you get to it. As I said earlier,
the Debian gates are closed so for now I don't mind waiting.

I don't know if this "query available memory" qualifies for little
thing. It was part the huge patch but it could be split out. There was
also the 3GiB limit for 32bit architectures I asked for compared to the
~4GiB limit which works only on 32bit xz running on 64bit host. The mips
patch that you merged is kind of in the similar corner (although I don't
remember if the fixed 2:2 split counts for all mips all just a specific
architecture).

Sebastian



Re: [xz-devel] [PATCH v2] liblzma: Add multi-threaded decoder

2021-04-11 Thread Lasse Collin
On 2021-01-28 Sebastian Andrzej Siewior wrote:
> On 2021-01-24 23:56:15 [+0200], Lasse Collin wrote:
> > This way only one worker will be frequently locking the main mutex.
> > However, I haven't tried it and thus don't know how much this
> > affects performance in practice. One possible problem might be that
> > it may introduce a small delay in output availability when the main
> > thread switches reading from the next outbuf in the list.  
> 
> I did implement it as described, didn't I?

Yes, sorry, I didn't read carefully enough.

After the XZ for Java 1.9 release I wanted a short break but it's
already been a month. Outside the Java version there are multiple XZ
things to do (not all on xz-devel) and I think I should do a few of the
small ones first to shorten the list a little before your big threading
patch. I know my handling of the threading patch may at this point
appear both rude and ridiculous but I just don't get enough things
done, I'm really sorry. :-(

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [PATCH v2] liblzma: Add multi-threaded decoder

2021-01-27 Thread Sebastian Andrzej Siewior
On 2021-01-24 23:56:15 [+0200], Lasse Collin wrote:
> Hello!
Hi,

> I haven't made much progress with this still, I'm sorry. :-( Below are
> comments about a few small details. It's not much but I will (slowly)
> keep reading and testing.

Thank you.

> (1) Segfault due to thr->outbuf == NULL
> 
> I changed CHUNK_SIZE to 1 to test corner cases. I used
> good-1-block_header-1.xz as the test file. It can segfault in
> worker_decoder() on the line calling thr->block_decoder.code(...)
> because thr->outbuf is NULL (so the problem was introduced in the outq
> patch). This happens because of "thr->outbuf = NULL;" later in the
> function.
> 
> It looks like that it marks the outbuf finished and returns the thread
> to the pool too early or forgets to set thr->state = THR_IDLE. As a
> temporary workaround, I added "thr->state = THR_IDLE;" after
> "thr->outbuf = NULL;".

I moved the logic to reset the in buffer only after LZMA_STREAM_END.

> (2) Block decoder must return LZMA_STREAM_END on success
> 
> Because of end marker and integrity check, the output buffer will be
> full before the last bytes of input have been processed by the Block
> decoder. Thus it is not enough to look at the input and output
> positions to determine when decoding has been finished; only
> LZMA_STREAM_END should be used to determine that decoding was
> successful.
> 
> In theory it is OK to mark the outbuf as finished once the output is
> full but for simplicity I suggest doing so (and returning the thread to
> the pool) only after LZMA_STREAM_END.
> 
> I committed a new test file bad-1-check-crc32-2.xz. The last byte in
> the Block (last byte of Check) is wrong. Change CHUNK_SIZE to 1 and try
> "xz -t -T2 file bad-1-check-crc32-2.xz". The file must be detected to
> be corrupt (LZMA_DATA_ERROR).
> 

I changed that as suggested.

> (3) Bad input where the whole input or output buffer cannot be used
> 
> In the old single-threaded decoding, lzma_code() will eventually return
> LZMA_BUF_ERROR if the calls to lzma_code() cannot make any progress,
> that is, no more input is consumed and no more output is produced. This
> condition can happen with correct code if the input file is corrupt in
> a certain way, for example, a truncated .xz file.
> 
> Since the no-progress detection is centralized in lzma_code(), the
> internal decoders including Block decoder don't try to detect this
> situation. Currently this means that worker_decoder() should detect it
> to catch bad input and prevent hanging on certain malformed Blocks.
> However, since the Block decoder knows both Compressed Size and
> Uncompressed Size, I think I will improve Block decoder instead so
> don't do anything about this for now.
> 
> I committed two test files, bad-1-lzma2-9.xz and bad-1-lzma2-10.xz. The
> -9 may make worker_decoder() not notice that the Block is invalid. The
> -10 makes the decoder hang. Like I said, I might fix these by changing
> the Block decoder.

So I updated the logic that if LZMA_OK is returned from the block
decoder and either the complete in-buffer or out-buffer has been
consumed then it returns LZMA_DATA_ERROR.

> (4) Usage of partial_update in worker_decoder()
> 
> Terminology: main mutex means coder->mutex alias thr->coder->mutex.
> 
> In worker_decoder(), the main mutex is locked every time there is new
> output available in the worker thread. partial_update is only used to
> determine when to signal thr->coder->cond.
> 
> To reduce contention on the main mutex, worker_decoder() could lock it
> only when
>   - decoding of the Block has been finished (successfully or
> unsuccessfully, that is, ret != LZMA_OK), or
>   - there is new output available and partial_update is true; if
> partial_update is false, thr->outbuf->pos is not touched.
> 
> This way only one worker will be frequently locking the main mutex.
> However, I haven't tried it and thus don't know how much this affects
> performance in practice. One possible problem might be that it may
> introduce a small delay in output availability when the main thread
> switches reading from the next outbuf in the list.

I did implement it as described, didn't I?

> (5) Use of mythread_condtime_set()
> 
> In the encoder the absolute time is calculated once per lzma_code()
> call. The comment in wait_for_work() in in stream_encoder_mt.c was
> wrong. The reason the absolute time is calculated once per lzma_code()
> call is to ensure that blocking multiple times won't make the timeout
> ineffective if each blocking takes less than timeout milliseconds. So
> it should be done similarly in the decoder.

oh, okay. I assumed it as in "waiting is okay but no longer than timeout
secs".
It is possible that at SEQ_INDEX time we wait less than `timeout' copy a
few bytes and then wait again less than `timeout' until the out-buffer
is full. But hey, we make progress ;)
At SEQ_BLOCK_HEADER we would wait only once and return on the second
iteration because we made progress. But yes, depending on how much has
been don

Re: [xz-devel] [PATCH v2] liblzma: Add multi-threaded decoder

2021-01-24 Thread Lasse Collin
Hello!

I haven't made much progress with this still, I'm sorry. :-( Below are
comments about a few small details. It's not much but I will (slowly)
keep reading and testing.

I applied the outq patch too. The performance numbers you posted looked
promising.


(1) Segfault due to thr->outbuf == NULL

I changed CHUNK_SIZE to 1 to test corner cases. I used
good-1-block_header-1.xz as the test file. It can segfault in
worker_decoder() on the line calling thr->block_decoder.code(...)
because thr->outbuf is NULL (so the problem was introduced in the outq
patch). This happens because of "thr->outbuf = NULL;" later in the
function.

It looks like that it marks the outbuf finished and returns the thread
to the pool too early or forgets to set thr->state = THR_IDLE. As a
temporary workaround, I added "thr->state = THR_IDLE;" after
"thr->outbuf = NULL;".


(2) Block decoder must return LZMA_STREAM_END on success

Because of end marker and integrity check, the output buffer will be
full before the last bytes of input have been processed by the Block
decoder. Thus it is not enough to look at the input and output
positions to determine when decoding has been finished; only
LZMA_STREAM_END should be used to determine that decoding was
successful.

In theory it is OK to mark the outbuf as finished once the output is
full but for simplicity I suggest doing so (and returning the thread to
the pool) only after LZMA_STREAM_END.

I committed a new test file bad-1-check-crc32-2.xz. The last byte in
the Block (last byte of Check) is wrong. Change CHUNK_SIZE to 1 and try
"xz -t -T2 file bad-1-check-crc32-2.xz". The file must be detected to
be corrupt (LZMA_DATA_ERROR).


(3) Bad input where the whole input or output buffer cannot be used

In the old single-threaded decoding, lzma_code() will eventually return
LZMA_BUF_ERROR if the calls to lzma_code() cannot make any progress,
that is, no more input is consumed and no more output is produced. This
condition can happen with correct code if the input file is corrupt in
a certain way, for example, a truncated .xz file.

Since the no-progress detection is centralized in lzma_code(), the
internal decoders including Block decoder don't try to detect this
situation. Currently this means that worker_decoder() should detect it
to catch bad input and prevent hanging on certain malformed Blocks.
However, since the Block decoder knows both Compressed Size and
Uncompressed Size, I think I will improve Block decoder instead so
don't do anything about this for now.

I committed two test files, bad-1-lzma2-9.xz and bad-1-lzma2-10.xz. The
-9 may make worker_decoder() not notice that the Block is invalid. The
-10 makes the decoder hang. Like I said, I might fix these by changing
the Block decoder.


(4) Usage of partial_update in worker_decoder()

Terminology: main mutex means coder->mutex alias thr->coder->mutex.

In worker_decoder(), the main mutex is locked every time there is new
output available in the worker thread. partial_update is only used to
determine when to signal thr->coder->cond.

To reduce contention on the main mutex, worker_decoder() could lock it
only when
  - decoding of the Block has been finished (successfully or
unsuccessfully, that is, ret != LZMA_OK), or
  - there is new output available and partial_update is true; if
partial_update is false, thr->outbuf->pos is not touched.

This way only one worker will be frequently locking the main mutex.
However, I haven't tried it and thus don't know how much this affects
performance in practice. One possible problem might be that it may
introduce a small delay in output availability when the main thread
switches reading from the next outbuf in the list.


(5) Use of mythread_condtime_set()

In the encoder the absolute time is calculated once per lzma_code()
call. The comment in wait_for_work() in in stream_encoder_mt.c was
wrong. The reason the absolute time is calculated once per lzma_code()
call is to ensure that blocking multiple times won't make the timeout
ineffective if each blocking takes less than timeout milliseconds. So
it should be done similarly in the decoder.


(6) Use of lzma_outq_enable_partial_output()

It should be safe to call it unconditionally:

if (thr->outbuf == coder->outq.head)
lzma_outq_enable_partial_output(&coder->outq,
thr_do_partial_update);

If outq.head is something else, it is either already finished or
partial output has already been enabled. In both cases
lzma_outq_enable_partial_output() will do nothing.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode



Re: [xz-devel] [PATCH v2] liblzma: Add multi-threaded decoder

2021-01-10 Thread Sebastian Andrzej Siewior
On 2021-01-09 22:21:23 [+0200], Lasse Collin wrote:
> Hello!
Hi,

> This sounds great but unfortunately I still haven't been able to
> properly read it yet. It may take a few more days. I apologize. :-(

no need to apologize.

> As a minor update from my side, I committed lzma_outq changes that I
> mostly did two weeks ago. I believe it should now be usable for
> threaded decompression too to decouple output buffers from threads.
> However, I don't mean that you must use it. If you wish to use it, it's
> OK to do the change later.

Let me look into that updated lzma_outq.

Sebastian



Re: [xz-devel] [PATCH v2] liblzma: Add multi-threaded decoder

2021-01-09 Thread Lasse Collin
Hello!

This sounds great but unfortunately I still haven't been able to
properly read it yet. It may take a few more days. I apologize. :-(

As a minor update from my side, I committed lzma_outq changes that I
mostly did two weeks ago. I believe it should now be usable for
threaded decompression too to decouple output buffers from threads.
However, I don't mean that you must use it. If you wish to use it, it's
OK to do the change later.

The docs are poor but:

  - When starting a Block and a new output buffer is needed, these must
be called in this order:

  * lzma_outq_has_buf(): if it fails, all buffers are in use
already.

  * lzma_outq_prealloc_buf(): ensures that a buffer of requested
size is available in the cache.

  * lzma_outq_get_buf(): gets a buffer from the cache.

  * lzma_outq_enable_partial_output() [MUTEX]: calls a callback to
tell the thread at the head of the queue to start making the
progress available to the main thread. Must be called with the
main mutex locked.

  - When reading decompressed output from the queue:

  * lzma_outq_is_readable() [MUTEX] can be used to poll if there is
output available.

  * lzma_outq_read() [MUTEX] is used for reading. It will return
LZMA_STREAM_END after the end of each buffer. In the
decompressor this is a sign that
lzma_outq_enable_partial_output() should be called before
trying to read more data.

  * lzma_outq_is_empty() can be used to detect when no more buffers
are pending and thus the end of the file may have been reached.

  - Worker threads:

  * lzma_outbuf.pos and .finished must be touched only with the
main mutex locked.

  * A simple call-back is needed for use with
lzma_outq_enable_partial_output().

On 2020-12-24 Sebastian Andrzej Siewior wrote:
> I moved parts of the memcpy() out the locked section. Only the thread,
> that is currently decompressing is waking the main thread. However the
> current output position is updated under the main-thread's mutex. So
> that might be not optimal.

I would expect it to be fine. When only one worker thread is updating
its status to the main thread, there won't be that much contention on
the mutex.

In the new lzma_outq, lzma_out_read() is called when the main mutex is
locked and so the copying from the intermediate buffer to the final
output buffer is done with the mutex locked so your code better here.
This isn't hard to change in lzma_outq but I didn't do it for this
commit to keep it less messy.

Anyway, I will read your patch carefully as soon as I'm able to focus
on it. Thanks a lot for your help!

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode