[xz-devel] Optimizing lzma_memcmplen for non-x86 processors

Lasse Collin Mon, 13 Oct 2014 12:29:43 -0700

XZ Utils 5.1.4beta got a speed optimization for buffer comparisons
which improves encoding speed. It works on systems that support
unaligned memory access. The relevant code is in
src/liblzma/common/memcmplen.h:


    http://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h

Different architectures get the best performance with different code.
The code should be decent for x86-64 and maybe also for 32-bit x86 (at
least the SSE2 version). Those may still have some room left for
improvement and help is welcome to improve them. However, no one has
looked at how the code could be improved for non-x86 archs, so I'm
especially interested in finding people to help with that.

I have heard that the generic versions work on little endian 32-bit ARM
and 32-bit big endian PowerPC. On those the generic code is slightly
faster than the byte-by-byte buffer comparison, but perhaps
arch-specific code could do better. The method used for x86-64 could be
good for other 64-bit CPUs too if __builtin_ctzll maps to a fast
instruction.

Timing the speed of "xz -e" when compressing a fairly compressible file
(many source code tarballs are such) is a good way to test different
lzma_memcmplen implementations. The reason for using -e is that the
relative improvement tends to be bigger when that option is used. On
x86-64 I've seen even 25 % faster compression with some files compared
to the byte-by-byte method.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

[xz-devel] Optimizing lzma_memcmplen for non-x86 processors

Reply via email to