2021.01.01 09:03:59 (1344917270393327617) from Daniel J. Bernstein, replying to "Luke Champine (@lukechampine)" (1344705145788100612):
Signing with many precomputed points can easily spend 30% or more of the mults on Fermat inversion, and then the question is how well mults are implemented vs state-of-the-art asm. But note that if you're bottlenecked by signing then you can batch inversions (Montgomery's trick).
2020.12.31 13:24:52 (1344620536249212933) from Daniel J. Bernstein:
Not verified yet, so don't put into production, but seems to compute inverses mod 2^255-19 in under 4800 Skylake cycles: https://gcd.cr.yp.to/software.html Also speed records on Haswell, Broadwell, Kaby Lake, etc. Joint work with Bo-Yin Yang. Uses convex-hull calculations from @pwuille.
2020.12.31 19:01:04 (1344705145788100612) from "Luke Champine (@lukechampine)":
What does this translate to in terms of ed25519 performance? Based on a quick test, I estimate that the fastest available Go implementation could sign ~23% faster with this asm, but I don't know how that compares to other implementations.