2022.06.05 22:07:25 (1533541018687066112) from Daniel J. Bernstein, replying to "Jacob Christian Munch-Andersen (@NoHatCoder)" (1533520109859479554):
Um, no, that's not how Intel CPUs work. Intel prioritizes speed, and then tries to reduce power without noticeable slowdowns. Agner Fog's example is AVX2 ramping up to full power in 56000 cycles and staying there unless there's _no_ 256-bit instruction for _millions_ of cycles.
2022.06.05 18:15:39 (1533482694482403328) from Daniel J. Bernstein, replying to "Jacob Christian Munch-Andersen (@NoHatCoder)" (1533358075314323456):
Ran a loop of 33 rdtsc+vqsort, each >8000 cycles for the smaller size that I mentioned. One always expects initial calls to be outliers (not just for AVX2 ramp-up; the big starting issue is code caching); djbsort's int32-speed (https://sorting.cr.yp.to/speed.html) says medians and quartiles.
2022.06.05 18:19:57 (1533483773773221888) from Daniel J. Bernstein:
AVX2 usage has also become so pervasive in typical code that it's not surprising for the CPU to always have the AVX2 unit warmed up; cooldown is triggered after millions of non-AVX2 cycles. But the more important point is to always check for variations across many measurements.
2022.06.05 20:43:46 (1533519967488024577) from "Jacob Christian Munch-Andersen (@NoHatCoder)":
So 33*8000 cycles, that is a tiny benchmark. I'm not sure why one algorithm would hit a consistent hiccup, and the other wouldn't, but stranger things have happened. As for AVX2, latest Steam hardware survey says 88% adoption. Most modern code is single path 128 bit.
2022.06.05 20:44:20 (1533520109859479554) from "Jacob Christian Munch-Andersen (@NoHatCoder)", replying to "Jacob Christian Munch-Andersen (@NoHatCoder)" (1533519967488024577):
You don't want a stray library or system call to run 256 bit for a few thousand cycles as the mode switch is not worth it. As for your benchmark, I guess the core would most likely be switched on from idle, so 256 bit path off I assume.