XZ Utils 5.1.4beta got a speed optimization for buffer comparisons which improves encoding speed. It works on systems that support unaligned memory access. The relevant code is in src/liblzma/common/memcmplen.h:
http://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h Different architectures get the best performance with different code. The code should be decent for x86-64 and maybe also for 32-bit x86 (at least the SSE2 version). Those may still have some room left for improvement and help is welcome to improve them. However, no one has looked at how the code could be improved for non-x86 archs, so I'm especially interested in finding people to help with that. I have heard that the generic versions work on little endian 32-bit ARM and 32-bit big endian PowerPC. On those the generic code is slightly faster than the byte-by-byte buffer comparison, but perhaps arch-specific code could do better. The method used for x86-64 could be good for other 64-bit CPUs too if __builtin_ctzll maps to a fast instruction. Timing the speed of "xz -e" when compressing a fairly compressible file (many source code tarballs are such) is a good way to test different lzma_memcmplen implementations. The reason for using -e is that the relative improvement tends to be bigger when that option is used. On x86-64 I've seen even 25 % faster compression with some files compared to the byte-by-byte method. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode