Before this it needed SSE 4.1, which is not strictly present on all x86-64 platforms. This will still compile the faster path if SSE 4.1 is available, but has an alternate path as well for all x86-64 platforms.
This implements the 4-wide API, and moves the renderer over to it. But the actual implementation is still scalar.