With just one round, this hash is better than the previous one with
two rounds. And at 2-3 rounds it seems to be just as good quality
as a slow, per-bit hashing approach, which I've been using as
ground-truth for testing.
Before this it needed SSE 4.1, which is not strictly present on
all x86-64 platforms. This will still compile the faster path if
SSE 4.1 is available, but has an alternate path as well for all
x86-64 platforms.