On 2020-12-14 Sebastian Andrzej Siewior wrote:
> On 2020-12-13 23:19:25 [+0200], Lasse Collin wrote:
> > Yes, reusing buffers and encoder/decoder states can be useful (fewer
> > page faults). Perhaps even the input buffer could be reused if it
> > is OK to waste some memory and it makes a difference in speed.  
> 
> I tried two archives with 16 & 3 CPUs and the time remained the same.
> I tried to only increase the in-buffer and to allocate the block-size
> also for the in-buffer. No change.

OK, better keep the code simple then.

> I tried to decouple the thread and the out-buffer but after several
> failures I tried to keep it simple for the start.
> I do have idle threads after a while with 16 CPUs. The alternative is
> to keep them busy with more memory.  With 4 CPUs I get to
> |  100 %         10,2 GiB / 40,0 GiB = 0,255   396 MiB/s       1:43
>           
> and this is ofcourse a CPU bound due to the `sha1' part as consumer.

Don't take me wrong, the performance is already very good. :-)

> I don't know how reasonable this performace is if it means that you
> have to write 400MiB/s to disk. Of course, should it become an issue
> then it can still be decoupled.

There are SSDs that are much faster and in some use cases the data
isn't written to a disk at all. That said, I don't know how much it
matters in practice. Better threading implementation can be (perhaps
significantly) faster but it's still diminishing returns.

I looked a bit at making lzma_outq usable for both encoding and
decoding. Basically it would be changing the fixed buffer allocation to
dynamic and caching most recently seen buffers of identical size. It
doesn't seem hard although there are a few details that may complicate
it.

One thing is passing data from the workers to the main thread.
lzma_outq relies on coder->mutex to protect lzma_outbuf.finished. This
is fine in the encoder where the whole block must be finished before
any of it can be copied out.

When decoding one can get smoother output by copying decompressed data
out in smaller chunks. Your code does this but it's still a single
mutex for all threads. With many threads that is easily thousands or
even tens of thousands of locks/unlocks per second from all threads
combined. I don't know how much it matters and if it is worth it to
make it more complex. For example, one could have a thread-specific flag
indicating if the main thread is interested in the data from that
thread. Then only that thread would lock/signal/unlock the main mutex
when a chunk of data is ready.

If the output buffering is decoupled using lzma_outq, the main thread
won't directly know in which thread it should set the flag. Instead,
lzma_outbuf needs a pointer to the associated worker (or NULL if
buffer is finished).

I already got started with this lzma_outq modification but I'm not sure
yet if or when I get it finished.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Reply via email to