2022.06.05 07:03:08 (1533313449815511041) from Daniel J. Bernstein, replying to "0b0000000000000 (@0b0000000000000)" (1533300068660301824):
Sorting in L1 cache is the most important use case in post-quantum crypto and many other applications. It's also the base case inside vqsort, and something the vqsort paper and code put considerable effort into. The vqsort claim was "fastest", not just "fastest for large sizes".
2022.06.05 07:08:03 (1533314687726538753) from Daniel J. Bernstein:
So far I haven't been able to verify these vqsort speed claims. On the contrary, it seems that, for 32-bit data types on AVX2, vqsort would be faster if its base-case code were replaced by a call to the 2018 djbsort code. Similarly, vqsort should reuse vxsort-cpp for AVX-512.