[1] S. Gueron, V. Krasnov: Parallelizing message schedules to accelerate
the
computations of hash functions, http://eprint.iacr.org/2012/067.pdf
...
As for Haswell. As discussed it's capable of executing 8xSMS SHA256 and
4xSMS SHA512, i.e. loading 8/4x input blocks
[1] S. Gueron, V. Krasnov: Parallelizing message schedules to accelerate the
computations of hash functions, http://eprint.iacr.org/2012/067.pdf
The AVX1 implementation:
The speedup (measured on Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz)