The goal of this document is to present the results of Quantum Fourier Transform (QFT) circuit simulation benchmarking. The main focus is investigation of the relation between the number of input qubits and the time required to complete the corresponding simulation. The influence of other parameters, such as the number of computing nodes used, is also examined.
The QFT simulations were set up using two different frameworks – QuEST and Qiskit. QuEST (Quantum Exact Simulation Toolkit) is a high performance simulator written in C and C++. It is well-suited for HPC, as it supports multithreading (OpenMP) and distributed multitasking via message passing (MPI). Qiskit, on the other hand, is a Python library, with core functions implemented in C++. It supports multithreading, and it is possible to leverage it to use parallel MPI tasks. This is done via cache blocking1.
The simulations were run on ARCHER2 computing cluster. The number of threads was fixed throughout simulations to be equal to the number of cores per node – 128. This way, each node was fully utilised2 without the need to pass any local messages, which would negatively impact the performance. On the other hand, the number of nodes used was varied between different powers of 2 (from 1 to 256)3, to examine the scaling speedup. In Qiskit implementation, the number of blocking qubits had to follow the formula: \(N_{block}=N_{qub}-\log_2{N_{proc}}\), and due to memory limitations could not be larger than 26, as otherwise the maximum size of passed messages would exceed 32-bit integers4.
The circuit corresponding to QFT for \(n\) qubits looks as follows:
It is a non-trivial circuit, which makes it particularly suitable for simulations – it consists of \(n\) Hadamard gates, \(\frac{1}{2}n(n-1)\) \(UROT_k\) gates, and \(\lfloor n/2 \rfloor\) swap gates. \(UROT_k\) represents a controlled phase shift by \(\frac{2\pi}{2^k}\). The overall action of the circuit on vector \(\lvert x \rangle\) can be written as: \[ QFT \lvert x \rangle = \frac{1}{\sqrt{2^n}}\bigotimes_{k=1}^n \left( \lvert 0 \rangle + \exp{\frac{2\pi ix}{2^k} \lvert 1 \rangle} \right) \] If we take \(x = 0\), the action becomes trivial: \[ QFT \lvert 0 \rangle = \frac{1}{\sqrt{2^n}}\bigotimes_{k=1}^n \left( \lvert 0 \rangle + \lvert 1 \rangle \right) = \frac{1}{\sqrt{2^n}}\sum_{i=0}^{2^n-1} \lvert i \rangle \] I.e. every state has an equal weight. Such a result is very easy to verify, validating that the written circuit is working correctly.
The results were written by a SLURM script to an output file in a CSV format with three columns – Nqubits, Ntasks and Time (in ms). In order to save on credits, most jobs were submitted for a fixed number of nodes, as otherwise some of the reserved resources would stay idle. The outputs of each series of computations were combined by hand5.
The final raw data transformed into more readable form are presented below. \(N=2^k\) columns represent the number of nodes, while floating point values represent timings in seconds. Some timings are not available due to memory limits, and these are noted as NA.
QuEST timings:
## # A tibble: 13 × 10
## Nqubits `N=1` `N=2` `N=4` `N=8` `N=16` `N=32` `N=64` `N=128` `N=256`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28 6.46 4.42 2.59 0.941 0.587 0.333 0.215 0.133 0.131
## 2 29 13.4 9.05 5.52 3.04 1.14 0.662 0.326 0.210 0.104
## 3 30 28.0 18.7 11.2 6.46 3.69 1.33 0.782 0.381 0.254
## 4 31 58.6 38.7 22.8 13.3 7.70 4.19 1.52 0.896 0.468
## 5 32 125. 80.9 47.4 27.1 15.6 8.81 4.70 1.75 1.02
## 6 33 264. 169. 97.7 55.7 30.9 17.5 9.71 5.18 2.04
## 7 34 NA NA 201. 114. 63.5 35.4 19.6 10.9 5.67
## 8 35 NA NA NA 235. 129. 72.0 39.4 21.9 12.1
## 9 36 NA NA NA NA 262. 147. 79.3 44.2 24.4
## 10 37 NA NA NA NA NA 299. 160. 88.9 48.3
## 11 38 NA NA NA NA NA NA 322. 179. 96.0
## 12 39 NA NA NA NA NA NA NA 358. 193.
## 13 40 NA NA NA NA NA NA NA NA 387.
Qiskit timings:
## # A tibble: 11 × 8
## Nqubits `N=1` `N=2` `N=4` `N=8` `N=16` `N=32` `N=64`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28 11.2 9.52 4.85 2.76 1.71 1.32 1.21
## 2 29 23.5 20.0 9.89 5.14 2.98 1.75 1.47
## 3 30 49.8 43.6 21.6 11.1 5.83 3.21 2.31
## 4 31 105. 92.2 45.4 23.2 12.1 6.27 4.18
## 5 32 220. 194. 96.2 50.1 25.8 13.3 7.89
## 6 33 481. 391. 190. 95.9 48.7 25.3 14.5
## 7 34 NA NA 412. 208. 107. 54.7 30.4
## 8 35 NA NA NA 442. 222. 112. 61.4
## 9 36 NA NA NA NA 476. 242. 130.
## 10 37 NA NA NA NA NA 483. 255.
## 11 38 NA NA NA NA NA NA 552.
The above runtimes can be plotted directly.
There is an exponential dependence between runtime and the number of qubits. It gets shifted by \(n\) when run on \(2^n\) nodes. With additional transformations it can be can seen how adding more nodes speeds the computation up for a fixed number of qubits (denoted by \(Q\) below). Ideally, by doubling the number of nodes, the speed should also double. However, it is not the case in reality, as not everything can be perfectly parallelised, and tasks need to exchange messages. This results in lower speedups that scale non-linearly.
The above plots show strong scaling, as only the number of nodes is increased, and problem size stays the same. It is apparent that by doubling the tasks, the speed increases by a factor smaller than 2. This factor varies, sometimes increasing with more tasks, and seems to be hitting a plateau for \(Q=28\). To give more insight into how the factor changes, strong scaling speedup relative to the previous computation (i.e. for \(n/2\) tasks) can be plotted.
Interestingly, each of the above relations shows a clear peak in the middle, of speedup larger than 2.
Finally, weak scaling can be investigated. It follows by scaling both the computing power and the problem size. There are two ways of comparing the results, both presented below. The first one presents efficiency, which is how much computing power each node seems to have on average, compared to a single-node task. This is expected to become lower with the number of nodes, due to parts of the problem that cannot be fully executed in parallel. Second, there is scaled speedup, which shows what the speedup would be if a fixed-size problem was executed on nodes with given efficiency.
The efficiency drops quickly on more nodes, promptly reaching values below 40%. With more qubits, it appears to be dropping slightly slower. Similarly, the scaled speedup is growing much slower than ideally, with e.g. 256 nodes effectively acting like 70–80 nodes.
The support for simultaneous multithreading (SMT) was turned off.↩︎
This choice has to do with how workload is divided between tasks – each node contains a chunk of the state vector. As a result, setting the number of nodes to powers of 2 allows nodes to work with equal-sized chunks.↩︎
J. Doi, H. Horii – Cache Blocking Technique to Large Scale Quantum Computing Simulation on Supercomputers↩︎
Actually, this was 31-bit integers, as the message size was be characterised by a signed value. In addition, the size referred to the number of bytes in the message, instead of complex values, which further limited the maximum number of blocking qubits.↩︎
It is possible to append results of the SLURM script via
#SBATCH --open-mode=append
option, however, for some reason
it locked out the directory that contained output files.↩︎