# On the distribution of RSA primes

In a previous blog post we explained that the RSA cryptosystem relies on our inability to factorise large integers. Specifically, each RSA certificate has a large integer, $n$ (modulus), and it’s security relies on us not being able to factorise this modulus into its prime factors, $p$ (prime1) and $q$ (prime2).

In this blog post we will visualise the values of $p$ (prime1) versus $q$ (prime2) for a large number of certificates. This distribution of primes may be of use to those looking to formulate algorithms to factorise RSA moduli. As you will see, the distribution of primes in these certificates is perhaps somewhat different from what one might intuitively expect.

## Theoretical distribution of RSA primes

As you may already know, when you generate RSA certicates on Ubuntu 18.04 using ssh-keygen, it generates a modulus that has exactly 2048-bits by default. That is, the smallest possible modulus, $n$, is $2^{2047}$ (the smallest 2048-bit number) and the largest possible $n$ is $2^{2048} -1$ (the largest 2048-bit number). With this in mind, and the fact that $n = pq$, you would be forgiven for thinking that the distribution of primes looks something like this:

Note that in the plots in this blog post we represent RSA certificates by using the x and y axes to represent the two prime factors of the modulus in the certificate. The plot above is hypothetical and does not accurately show the distribution of real RSA certificates for a number of reasons – we will discuss this below.

## Actual distribution of RSA primes

There are a number of implicit assumptions we made in Figure 1 that turn out to be invalid. Firstly, we assumed that there is no restriction on the values of primes. However, in practice, as well as the modulus always being a 2048-bit integer, the primes factors in the keys that ssh-keygen generates are always 1024-bit integers themselves. This means both $p$ and $q$ are at least $2^{1023}$ (the smallest 1024-bit number) and at most $2^{1024}-1$ (the largest 1024-bit number).

Looking at Figure 1, clearly these lower and upper bounds on $p$ and $q$ themselves are not sufficient to guarantee $n$ is 2048 bits as some points will lie below the blue line. That is, while multiplying the largest possible 1024-bit values for $p$ and $q$ together guarantees a 2048-bit modulus, multiplying the smallest 1024-bit number together does not have the same guarantee (i.e. $(2^{1023})^2$ only has 2047 bits). From our experiments with real certificates we concluded that to guarantee the modulus is indeed 2048 bits ssh-keygen uses $0.75 \cdot{2^{1024}}$ as a lower bound for both $p$ and $q$.1 Note that a less strict lower bound of $\lceil\sqrt{2^{2047}}\rceil \approx 0.71 \cdot{2^{1024}}$ would also have sufficed.

Last but not least, primes in ssh-keygen certificates are always ordered.1 That is, $p$ (prime1) is always smaller than $q$ (prime2). Putting these constraints together leaves us with the following distribution of primes:

Unlike the previous plot (Figure 1) the plot above (Figure 2) was generated using 1,000 real certificates generated by ssh-keygen.

## Not all certificates are created equal

It can be clearly observed from Figure 2 that all certificates are located within a triangular bit of 2-dimensional space rather than a band-like shape as we originally thought in Figure 1. This makes for another interesting observation: some certificates may be significantly easier to factorise than others.

To see this, please consider Modulus A and Modulus B depicted in Figure 3 above as a purple square and a brown pentagon, respectively. Imagine not knowing the actual prime factors, just the moduli, and trying to factorise the moduli into their respective prime factors. Because you know that the primes need to adhere to $pq = n$ they must be on the respective dashed lines plotted in Figure 3. As you can see, these lines are of substantially different lengths for the two highlighted moduli. If we had a naive algorithm that would iterate over all possible values of $p$ and check if $p$ divides $n$ there would be significantly fewer possible values of $p$ to consider for Modulus B compared to Modulus A.

Don’t be mistaken, though – taking a significant part off a really large number still leaves a really large number. Even taking the certificate closest to the very tip of the triangle above leaves us with an integer that is impossible to factorise with currently known methods.

1. While these observations have been verified on 10,000 certificates and we are reasonably sure that these statements are generally true, we are unable to provide a guarantee as to the accuracy of these observations.  2