[DRAFT] Fast Number Theoretic Transform

May 23, 2025

Here are the things I looked at to inform this article:

Number Theoretic Transform by Thomas Ammer and Katharina Kreuzer
- An Isabelle formalization of the Fast Number Theoretic Transform and Inverse Fast Number Theoretic Transform algorithms
Fast Fourier Transform and Convolution Algorithms by H.J. Nussbaumer
- Chapter 8 has a section on Number Theoretic Transforms
A Complete Beginner Guide to the Number Theoretic Transform by Satriawan, Mareta, Lee
- A beginner’s guide to understanding most of the pieces of NTTs
cryptography101.ca by Dr. Alfred Menezes

General themes:

Moving values in one domain to another domain
Speeding up computation

Before I start with the same example that Satriawan, Mareta, and Lee use, I should first start with the implicit goal that I have in mind.

Classical asymmetric cryptography is broken. Peter Shor showed in 1994 that there exists a solution to the discrete logarithm problem and the integer factoring problem that would run on quantum computers to a superpolynomial speedup compared to the best known classical computer solution. Since classical asymmetric cryptography depends on both of the aforementioned problems, we no longer can really depend on our existing solutions.

Since then, we’ve spent lots of time brainstorming good future solutions for signature algorithms and asymmetric cryptography.

The one that I’m interested in is called ML-KEM (standardized as FIPS 203 and known historically as Kyber).

ML-KEM (Module-Lattice-Based Key-Encapsulation Mechanism) is a module learning with errors problem.

These words aren’t that important to know right now, but what we do need to consider is that ML-KEM relies a lot on multiplying polynomials. (At least their computer representation.)

Let’s take a brief intro into multiplying polynomials.

from numpy.polynomial import Polynomial

# 4.0 + 1.0·x + 2.0·x² + 3.0·x³
f = Polynomial([4, 1, 2, 3])

# 3.0 + 3.0·x + 0.0·x² + 1.0·x³
g = Polynomial([3, 3, 0, 1])

# 12.0 + 15.0·x + 9.0·x² + 19.0·x³ + 10.0·x⁴ + 2.0·x⁵ + 3.0·x⁶
h = f * g

Discrete convolution can be expressed as:

$$ y[n] = \sum_{l = - \infty}^{l = +\infty}h[l]x[n-l] $$ We assume that $h$ and $x$ are non-zero for only $0$ to $N-1$ and $0$ to $M - 1$ respectively. Each of those are implicitly their length. Summing those two lengths we get a total length of $N - 1 + M - 1 + 1 = N + M - 1$.

Let’s now implement this discrete convolution in Python.

def convolve(f, g):
    f_len, g_len = len(f), len(g)
    if f_len != g_len:
        raise ValueError("g and f must have the same lengths")

    h_len = f_len + g_len - 1
    h = [0] * h_len

    for n in range(h_len):
        acc = 0
        for L in range(f_len):
            if 0 <= (n - L) < g_len:
                acc += f[L] * g[n - L]
        h[n] = acc

    return h
    
f = [4, 1, 2, 3, 3]
g = [3, 3, 0, 1, 9]

print(convolve(f, g))

To analyze the time complexity, let’s look at the two loops used.

h_len we established to be of length f_len + g_len - 1. Thus, we should have a total loop time of f_len * (f_len + g_len - 1). We also stipulate that f_len == g_len, so this is equivalent to 2*f_len^2 - f_len. The dominant term here is f_len^2, so the time complexity is $O(n^2)$.

We should also note that the convolution present in ML-KEM is cyclic convolution.

Essentially it is the same process of convolution from before, but then divided by some $x^n + 1$ (in the case of negatively wrapped convolution).

For some set of polynomials of degree less than or equal to $n$ (denoted $P_n$), the dimension of that vector space is $n + 1$. We see this is the case because of the length of the canonical basis for this vector space: $$ {1, x, x^2, \cdots, x^n} $$ We can express the formula for negative wrapped convolution as: $$ NWC(x) = Y(X) / (x^n + 1) $$ where $Y(x)$ is the result of the linear convolution of the two polynomials.

We also make the interesting observation that the length of the negative wrapped convolution is $n$ instead of $2n - 1$. (Since we expect the length of both elements to be of the same length.)

Notice beforehand that the operation $\times$ on $P_n$ was not actually closed. When we picked two polynomials of a given order, the output was not actually in $P_n$.

Being closed on this set is useful, since it makes the algorithm feasible to write and also restricts the amount of memory we should expect to have the allocate.