The goal of this document is to present the results of *Quantum
Fourier Transform (QFT)* circuit simulation benchmarking. The main
focus is investigation of the relation between the number of input
qubits and the time required to complete the corresponding simulation.
The influence of other parameters, such as the number of computing nodes
used, is also examined.

The QFT simulations were set up using two different frameworks – QuEST and Qiskit. *QuEST*
(Quantum Exact Simulation Toolkit) is a high performance simulator
written in *C* and *C++*. It is well-suited for HPC, as it
supports multithreading (*OpenMP*) and distributed multitasking
via message passing (*MPI*). *Qiskit*, on the other hand,
is a *Python* library, with core functions implemented in
*C++*. It supports multithreading, and it is possible to leverage
it to use parallel *MPI* tasks. This is done via cache blocking^{1}.

The simulations were run on *ARCHER2* computing cluster. The
number of threads was fixed throughout simulations to be equal to the
number of cores per node – 128. This way, each node was fully utilised^{2} without
the need to pass any local messages, which would negatively impact the
performance. On the other hand, the number of nodes used was varied
between different powers of 2 (from 1 to 256)^{3}, to examine the
scaling speedup. In *Qiskit* implementation, the number of
blocking qubits had to follow the formula: \(N_{block}=N_{qub}-\log_2{N_{proc}}\), and
due to memory limitations could not be larger than 26, as otherwise the
maximum size of passed messages would exceed 32-bit integers^{4}.

The circuit corresponding to QFT for \(n\) qubits looks as follows:

It is a non-trivial circuit, which makes it particularly suitable for simulations – it consists of \(n\) Hadamard gates, \(\frac{1}{2}n(n-1)\) \(UROT_k\) gates, and \(\lfloor n/2 \rfloor\) swap gates. \(UROT_k\) represents a controlled phase shift by \(\frac{2\pi}{2^k}\). The overall action of the circuit on vector \(\lvert x \rangle\) can be written as: \[ QFT \lvert x \rangle = \frac{1}{\sqrt{2^n}}\bigotimes_{k=1}^n \left( \lvert 0 \rangle + \exp{\frac{2\pi ix}{2^k} \lvert 1 \rangle} \right) \] If we take \(x = 0\), the action becomes trivial: \[ QFT \lvert 0 \rangle = \frac{1}{\sqrt{2^n}}\bigotimes_{k=1}^n \left( \lvert 0 \rangle + \lvert 1 \rangle \right) = \frac{1}{\sqrt{2^n}}\sum_{i=0}^{2^n-1} \lvert i \rangle \] I.e. every state has an equal weight. Such a result is very easy to verify, validating that the written circuit is working correctly.

The results were written by a SLURM script to an output file in a CSV
format with three columns – *Nqubits*, *Ntasks* and
*Time* (in *ms*). In order to save on credits, most jobs
were submitted for a fixed number of nodes, as otherwise some of the
reserved resources would stay idle. The outputs of each series of
computations were combined by hand^{5}.

The final raw data transformed into more readable form are presented
below. \(N=2^k\) columns represent the
number of nodes, while floating point values represent timings in
seconds. Some timings are not available due to memory limits, and these
are noted as *NA*.

*QuEST* timings:

```
## # A tibble: 13 × 10
## Nqubits `N=1` `N=2` `N=4` `N=8` `N=16` `N=32` `N=64` `N=128` `N=256`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28 6.46 4.42 2.59 0.941 0.587 0.333 0.215 0.133 0.131
## 2 29 13.4 9.05 5.52 3.04 1.14 0.662 0.326 0.210 0.104
## 3 30 28.0 18.7 11.2 6.46 3.69 1.33 0.782 0.381 0.254
## 4 31 58.6 38.7 22.8 13.3 7.70 4.19 1.52 0.896 0.468
## 5 32 125. 80.9 47.4 27.1 15.6 8.81 4.70 1.75 1.02
## 6 33 264. 169. 97.7 55.7 30.9 17.5 9.71 5.18 2.04
## 7 34 NA NA 201. 114. 63.5 35.4 19.6 10.9 5.67
## 8 35 NA NA NA 235. 129. 72.0 39.4 21.9 12.1
## 9 36 NA NA NA NA 262. 147. 79.3 44.2 24.4
## 10 37 NA NA NA NA NA 299. 160. 88.9 48.3
## 11 38 NA NA NA NA NA NA 322. 179. 96.0
## 12 39 NA NA NA NA NA NA NA 358. 193.
## 13 40 NA NA NA NA NA NA NA NA 387.
```

*Qiskit* timings:

```
## # A tibble: 11 × 8
## Nqubits `N=1` `N=2` `N=4` `N=8` `N=16` `N=32` `N=64`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28 11.2 9.52 4.85 2.76 1.71 1.32 1.21
## 2 29 23.5 20.0 9.89 5.14 2.98 1.75 1.47
## 3 30 49.8 43.6 21.6 11.1 5.83 3.21 2.31
## 4 31 105. 92.2 45.4 23.2 12.1 6.27 4.18
## 5 32 220. 194. 96.2 50.1 25.8 13.3 7.89
## 6 33 481. 391. 190. 95.9 48.7 25.3 14.5
## 7 34 NA NA 412. 208. 107. 54.7 30.4
## 8 35 NA NA NA 442. 222. 112. 61.4
## 9 36 NA NA NA NA 476. 242. 130.
## 10 37 NA NA NA NA NA 483. 255.
## 11 38 NA NA NA NA NA NA 552.
```

The above runtimes can be plotted directly.