On 05/14/14 23:56, Daniel Verkamp wrote: > Here's a quick cleanup of md5sum. Executive summary: smaller and faster. > > On my machine, for a 2.2 GB file of random bytes, the timings with > warm cache are: > toybox before: 11.4 seconds > toybox after: 8.3 seconds > GNU md5sum: 3.9 seconds > openssl dgst -md5: 3.5 seconds > > This is clearly better than before (3x openssl), but still slow (2x openssl). > > I suspect there is more low-hanging fruit to be had by eliminating the > memcpy in hash_update (maybe not too much - hash_update accounts for > about 4% of total runtime versus 92% for md5_transform according to > perf - but this would also help sha1sum). > > make bloatcheck on x86_64 gcc 4.8.2 -Os: > name old new delta > ----------------------------------------------------------------------- > md5rot 0 64 64 > md5_transform 365 223 -142 > ----------------------------------------------------------------------- > -78 total
Indeed. > Rationale for the changes: > > Move definition of 'rol' up so it can be used in md5_transform. This > is purely cosmetic; it expands to exactly the same code. > > Put rotation counts in a lookup table instead of calculating them on > the fly. This is mostly a wash size-wise, +5 bytes total, but > worthwhile for readability and speed. I tried removing the repetition from the table and indexing it by (i&3)+((i>>2)&~3) and an md5sum of a dvd image I had lying around when from 120 seconds to 132 seconds. :P Sigh. I guess it's a form of loop unrolling. I wasn't trying for the fastest implementation out there (I've seen ones with inline assembly), I was going for "readable". I guess if your idea of readable includes that much repetition... > Instead of accessing the state array using a rotating index (the > variable formerly known as 'a'), access the state with constant > offsets and rotate the contents of the array instead. This is the big > win - it eliminates all the crazy memory addressing math inside the > loop. Ah yes, crazy that I didn't hide the math in a macro to generate the index. I guess x86-64 can keep it all in registers so the extra assignments are cheaper than the extra math operations that avoid the need to move the data. *shrug* Isn't always obvious to me what "simple" is for something like this. Not gonna argue with actual measurements... Rob _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
