This contains both a pure Go and an amd64 assembly implementation of
operations over GF(2^255-19) using radix 2^51. This results in notable
speedups when using the assembly, but doesn't help much in pure Go -
most of the possible gains are lost to the lack of widening multiply for
64 bit integers.
Since we are always converting from affine, we know that Z1=1. This
formula is slightly faster and avoids converting through
CompletedGroupElement unnecessarily.
Assumptions: Z1=1.
Cost: 2M + 4S + 1*a + 7add + 1*2.
Source: 2008 Bernstein-Birkner-Joye-Lange-Peters,
https://eprint.iacr.org/2008/013,
plus Z1=1, plus standard simplification.
Explicit formulas:
B = (X1+Y1)^2
C = X1^2
D = Y1^2
E = a*C
F = E+D
X3 = (B-C-D)*(F-2)
Y3 = F*(E-D)
Z3 = F2-2*F
https://hyperelliptic.org/EFD/g1p/auto-twisted-projective.html#doubling-mdbl-2008-bbjlp