Introduction

As we are entering the era when quantum computing are becoming viable, it is especially important to focus on classical simulations of quantum systems. They can provide a useful insight into mechanisms behind quantum phenomena, and be used to verify the results of the new technology. In addition, current quantum devices are often call Noisy Intermediate-Scale Quantum (NISQ), due to high amounts of noise that affects them. As a result, it may still be possible to keep classical simulations competitive, and incentivise the development of both paradigms.

In general, simulating quantum circuits is an exponentially hard task – both in terms of runtime and memory. This is because a quantum state of \(n\) qubits is equivalent to some superposition of all possible \(n\)-bit strings, each weighted by a complex amplitude (whose modulus squared is equal to the probability of measuring the corresponding string). Therefore, there are \(2^n\) complex numbers describing the state. However, many states can be separated into products of smaller states. If a state is separable into single-qubit states, it is non-entangled. Otherwise, there is some degree of quantum entanglement present – which describes correlations in the measurements of individual qubits for the given system. This is what differentiates quantum computing from its classical counterpart, since classicaly, those measurements would be independent. There are methods for compressing states with less entanglement, which will be described below. Notwithstanding that, it is quantum simulations are obviously very energy inefficient.

Two different circuits were selected for benchmarking – Quantum Fourier Transform (QFT) and Random Quantum Circuit (RAND). Likewise, two platforms were used two create and run the simulations, each representing a distinct approach. Quantum Exact Simulation Toolkit (QuEST) was used for full state vector simulations, while iTensor was leveraged to perform circuit simulations by contracting tensor networks of matrix product states (MPS) and matrix product operators (MPO).

Each approach allows different modifications to save the energy. QuEST requires multiprocessing for larger simulations due to memory limitations, which makes it very heavy on inter-node communication. Therefore, it was found to be beneficial to slightly reduce CPU clock frequency, so that most of the CPU time is not wasted waiting for the messages to arrive. On the other hand, iTensor is currently completely serial and single-threaded, but features a representation that makes compression easy for states with limited entanglement. Hence, it was found to be better for some algorithms and input states, despite its limitations.

Methodology

Below presented are circuits and frameworks used in this work.

Quantum Fourier Transform

QFT is a very common component of other quantum algorithms. It does a quantum analog to Discrete Fourier Transform on the input state, described by the following formula: \[ QFT_N \lvert x \rangle = \frac{1}{\sqrt{N}}\sum_{y=0}^{N-1} \omega_N^{xy} \lvert y \rangle \]

Where \(N=2^n\) is the number of basis states, and \(\omega_N=e^{\frac{2\pi i}{N}}\). This can be re-expressed as a tensor product of single-qubit states: \[ QFT_N \lvert x \rangle = \frac{1}{\sqrt{N}}\bigotimes_{k=1}^n \left( \lvert 0 \rangle + \exp{\frac{2\pi ix}{2^k} \lvert 1 \rangle} \right) \]

Therefore, if the input state is the basis state \(\lvert x \rangle\), its Quantum Fourier Transform won’t introduce any new entanglement. However, if the input is in a superposition of basis states \(\lvert \psi \rangle = \sum_{x=0}^{N-1} \psi_x \lvert x \rangle\), the corresponding transform will be a superposition of transforms of each basis state. This way, some amplitudes may cancel each other out, and alter the entanglement of the result. We say that the QFT is weakly entangling, which means its output can have a different level of entanglement than input, but it is unlikely to arbitrarily create a very entangled state.

The circuit is easy to verify, as inputting a \(\vert 0 \rangle\) state should output an equal superposition of all states. It consists of \(\mathcal{O}(n)\) non-diagonal Hadamard gates and \(\mathcal{O}(n^2)\) diagonal controlled \(\theta\) phase shift gates.  [1]

Random Circuit

A simplified version of the circuit used in Google’s quantum advantage claim. It is built of multiple layers that create entanglement. First, each qubit is acted on by a random gate from the set \([\sqrt{X}, \sqrt{Y}, \sqrt{W}]\); the neighbouring qubits are then entangled with controlled \(\frac{\pi}{2}\) phase shift gates. At least \(n\) layers are necessary to significantly entangle \(n\) qubits. The circuit has \(\mathcal{O}(n^2)\) random non-diagonal single-qubit gates, and likewise, \(\mathcal{O}(n^2)\) diagonal two-qubit \(CP\) gates.  [2]

Simulations

There are two main kinds of methods for solving quantum circuits. The first one is based on a vector representation of the statevector, and requires 𝒪(2n) memory for n qubits. It is possible to split the state into multiple parts, and connect them via Feynman paths, but this increases time complexity by a factor of 4 for each connection between parts. As a result, this algorithm is very expensive either in terms of memory requirements, or execution time.  [3]

An alternative is to treat the circuit as a network of tensors. This approach is much more flexible, as the contraction between tensors can be done in any order. In addition, such a structure is perfect for immediate computation of required amplitudes – by appending a given output state to the network. This is a full contraction, which yields a scalar, and under optimal contraction order can be a lot faster than computing the whole statevector. However, most of the information is lost that way, and if too many amplitudes are needed, the runtime rapidly increases. This problem has been addressed in the literature, and many methods for computing batch amplitudes have been developed.  [3]

All of the above methods are exponential in some parameter. In principle, this is what makes quantum computers better than their classical counterparts. However, if we factor in noise, we can develop polynomial algorithms. For example, a statevector can be represented by a matrix product state (MPS), which stores the entanglement between qubits in singular values. If we limit the number of singular values, and truncate them as the circuit evolves, we emulate a kind of decoherence noise. This limits the size of the statevector and the number of computations to a polynomial order.  [4,5]

Results

The results were written by a SLURM script to an output file in a CSV format with three columns – Nqubits, Ntasks and Time (in ms). In order to save on credits, most jobs were submitted for a fixed number of nodes, as otherwise some of the reserved resources would stay idle. The outputs of each series of computations were combined by hand.

The final raw data transformed into more readable form are presented below. \(N=2^k\) columns represent the number of nodes, while floating point values represent timings in seconds. Some timings are not available due to memory limits, and these are noted as NA.

QuEST timings:

Frequency Circuit Nqubits N=1 N=2 N=4 N=8 N=16 N=32 N=64 N=128
Low QFT 32 137.8 89.2 50.4 30.1 17.3 9.8 5.4 2.4
Medium QFT 32 129.5 84.1 47.4 27.7 15.9 9.2 5.0 2.1
Highm1 QFT 32 129.4 84.4 47.4 28.4 16.3 9.0 5.0 2.1
High QFT 32 124.9 80.5 45.5 27.1 15.2 8.8 4.7 1.8
Low QFT 33 281.7 189.7 103.9 60.9 34.7 19.4 10.8 6.0
Medium QFT 33 272.5 179.7 98.5 57.6 32.9 18.6 10.2 5.5
Highm1 QFT 33 271.7 178.3 95.1 57.7 32.1 18.6 10.2 5.5
High QFT 33 251.2 169.1 93.3 53.7 31.4 17.3 9.7 5.1
Low QFT 34 NA NA 216.1 125.4 70.6 39.8 21.8 12.2
Medium QFT 34 NA NA 204.7 119.0 64.9 37.6 20.7 11.4
Highm1 QFT 34 NA NA 203.6 118.1 65.4 37.5 20.6 11.4
High QFT 34 NA NA 192.7 110.7 63.3 35.8 19.6 10.8
Low QFT 35 NA NA NA 265.2 141.3 80.1 44.4 24.4
Medium QFT 35 NA NA NA 243.5 135.1 75.7 42.0 23.0
Highm1 QFT 35 NA NA NA 244.1 134.8 75.7 42.1 23.3
High QFT 35 NA NA NA 230.8 128.3 72.4 39.9 21.8
Low QFT 36 NA NA NA NA 295.2 164.1 90.6 49.5
Medium QFT 36 NA NA NA NA 278.3 154.5 85.7 46.5
Highm1 QFT 36 NA NA NA NA 279.0 154.6 85.2 46.4
High QFT 36 NA NA NA NA 263.1 146.6 81.4 44.2
Low RAND 32 768.8 581.4 397.6 259.0 156.7 93.0 52.7 28.4
Medium RAND 32 743.0 551.7 374.9 242.0 146.9 86.7 48.9 25.9
Highm1 RAND 32 741.3 555.5 373.9 243.7 147.2 93.8 48.8 25.8
High RAND 32 712.9 528.0 354.9 229.1 139.1 81.9 46.0 24.1
Low RAND 33 1626.4 1224.5 814.8 526.4 324.1 191.8 110.7 61.7
Medium RAND 33 1572.8 1163.3 768.0 498.4 304.0 179.1 103.4 57.6
Highm1 RAND 33 1578.9 1165.3 765.2 495.6 305.2 179.4 103.4 57.5
High RAND 33 1523.7 1101.9 724.6 470.5 287.0 169.5 104.1 53.9
Low RAND 34 NA NA 1665.3 1079.5 653.3 396.2 227.5 129.4
Medium RAND 34 NA NA 1564.2 1011.3 613.2 370.5 231.2 120.9
Highm1 RAND 34 NA NA 1559.6 1007.1 612.2 370.5 213.0 121.1
High RAND 34 NA NA 1476.2 971.9 577.6 349.3 201.0 113.9
Low RAND 35 NA NA NA 2175.8 1322.8 803.5 473.6 277.2
Medium RAND 35 NA NA NA 2033.4 1231.9 750.7 443.9 249.5
Highm1 RAND 35 NA NA NA 2043.6 1237.9 750.5 443.9 248.6
High RAND 35 NA NA NA 1943.4 1158.4 708.0 417.9 234.0
Low RAND 36 NA NA NA NA 2692.7 1619.6 979.4 552.7
Medium RAND 36 NA NA NA NA 2511.6 1505.8 916.2 516.4
Highm1 RAND 36 NA NA NA NA 2505.7 1510.4 917.4 515.8
High RAND 36 NA NA NA NA 2349.9 1414.4 862.1 486.1

ITensor timings:

Circuit Nqubits InitState MaxDim Runtime Energy Experiment
BENCH 12 |0..0> 256 0.805679 622 RAND (trimmed)
BENCH 12 |0..0> 16777216 0.802858 694 RAND (full)
QFT 12 |0..0> 16777216 0.321774 315 QFT (0 input)
QFT 12 |Wn> 16777216 0.252076 276 QFT (W input)
BENCH 14 |0..0> 256 1.377220 870 RAND (trimmed)
BENCH 14 |0..0> 16777216 1.378290 886 RAND (full)
QFT 14 |0..0> 16777216 0.398844 358 QFT (0 input)
QFT 14 |Wn> 16777216 0.342208 345 QFT (W input)
BENCH 16 |0..0> 256 2.885560 1320 RAND (trimmed)
BENCH 16 |0..0> 16777216 2.807600 1150 RAND (full)
QFT 16 |0..0> 16777216 0.482060 326 QFT (0 input)
QFT 16 |Wn> 16777216 0.475178 294 QFT (W input)
BENCH 18 |0..0> 256 7.969530 2820 RAND (trimmed)
BENCH 18 |0..0> 16777216 10.585000 3360 RAND (full)
QFT 18 |0..0> 16777216 0.595939 345 QFT (0 input)
QFT 18 |Wn> 16777216 0.610556 360 QFT (W input)
BENCH 20 |0..0> 256 17.117900 4740 RAND (trimmed)
BENCH 20 |0..0> 16777216 53.225500 14380 RAND (full)
QFT 20 |0..0> 16777216 0.682830 364 QFT (0 input)
QFT 20 |Wn> 16777216 0.811472 346 QFT (W input)
BENCH 22 |0..0> 256 30.693400 8260 RAND (trimmed)
BENCH 22 |0..0> 16777216 331.654000 81070 RAND (full)
QFT 22 |0..0> 16777216 0.869382 370 QFT (0 input)
QFT 22 |Wn> 16777216 1.081210 549 QFT (W input)
BENCH 24 |0..0> 256 47.553300 12580 RAND (trimmed)
BENCH 24 |0..0> 16777216 1563.180000 379340 RAND (full)
QFT 24 |0..0> 16777216 1.131030 480 QFT (0 input)
QFT 24 |Wn> 16777216 1.358740 548 QFT (W input)
BENCH 26 |0..0> 256 68.717700 17140 RAND (trimmed)
QFT 26 |0..0> 16777216 1.384400 530 QFT (0 input)
QFT 26 |Wn> 16777216 1.713760 621 QFT (W input)
BENCH 28 |0..0> 256 94.232300 25920 RAND (trimmed)
QFT 28 |0..0> 16777216 1.670910 607 QFT (0 input)
QFT 28 |Wn> 16777216 2.097140 670 QFT (W input)
BENCH 30 |0..0> 256 123.518000 31140 RAND (trimmed)
QFT 30 |0..0> 16777216 1.966470 649 QFT (0 input)
QFT 30 |Wn> 16777216 2.447190 851 QFT (W input)
BENCH 32 |0..0> 256 157.958000 40530 RAND (trimmed)
QFT 32 |0..0> 16777216 2.297540 724 QFT (0 input)
QFT 32 |Wn> 16777216 2.878970 843 QFT (W input)
BENCH 34 |0..0> 256 193.612000 57920 RAND (trimmed)
QFT 34 |0..0> 16777216 2.702240 878 QFT (0 input)
QFT 34 |Wn> 16777216 3.955050 1240 QFT (W input)
BENCH 36 |0..0> 256 236.021000 72610 RAND (trimmed)
QFT 36 |0..0> 16777216 3.230490 1120 QFT (0 input)
QFT 36 |Wn> 16777216 5.529280 1610 QFT (W input)
BENCH 38 |0..0> 256 279.868000 80810 RAND (trimmed)
QFT 38 |0..0> 16777216 3.677020 1060 QFT (0 input)
QFT 38 |Wn> 16777216 7.410140 2080 QFT (W input)
BENCH 40 |0..0> 256 332.627000 93580 RAND (trimmed)
QFT 40 |0..0> 16777216 4.413010 1280 QFT (0 input)
QFT 40 |Wn> 16777216 10.517300 2540 QFT (W input)
BENCH 42 |0..0> 256 402.321000 124050 RAND (trimmed)
QFT 42 |0..0> 16777216 4.727110 1350 QFT (0 input)
QFT 42 |Wn> 16777216 13.946100 3470 QFT (W input)
BENCH 44 |0..0> 256 443.380000 126170 RAND (trimmed)
QFT 44 |0..0> 16777216 5.732970 1670 QFT (0 input)
QFT 44 |Wn> 16777216 18.032500 4530 QFT (W input)
BENCH 46 |0..0> 256 502.356000 149490 RAND (trimmed)
QFT 46 |0..0> 16777216 6.363610 1700 QFT (0 input)
QFT 46 |Wn> 16777216 23.087000 5530 QFT (W input)
BENCH 48 |0..0> 256 573.018000 172720 RAND (trimmed)
QFT 48 |0..0> 16777216 7.198300 2090 QFT (0 input)
QFT 48 |Wn> 16777216 28.961300 7470 QFT (W input)
BENCH 50 |0..0> 256 650.325000 205200 RAND (trimmed)
QFT 50 |0..0> 16777216 7.752420 1910 QFT (0 input)
QFT 50 |Wn> 16777216 36.582000 8740 QFT (W input)
BENCH 52 |0..0> 256 713.854000 220620 RAND (trimmed)
QFT 52 |0..0> 16777216 9.426870 2410 QFT (0 input)
QFT 52 |Wn> 16777216 45.634500 11640 QFT (W input)
BENCH 54 |0..0> 256 794.648000 269100 RAND (trimmed)
QFT 54 |0..0> 16777216 10.160100 2800 QFT (0 input)
QFT 54 |Wn> 16777216 56.369100 13620 QFT (W input)
BENCH 56 |0..0> 256 886.571000 291100 RAND (trimmed)
QFT 56 |0..0> 16777216 11.736600 2720 QFT (0 input)
QFT 56 |Wn> 16777216 69.114500 17390 QFT (W input)
BENCH 58 |0..0> 256 962.006000 303530 RAND (trimmed)
QFT 58 |0..0> 16777216 12.803100 3350 QFT (0 input)
QFT 58 |Wn> 16777216 89.085500 21630 QFT (W input)
BENCH 60 |0..0> 256 1059.990000 382820 RAND (trimmed)
QFT 60 |0..0> 16777216 13.835500 3500 QFT (0 input)
QFT 60 |Wn> 16777216 106.931000 24800 QFT (W input)
BENCH 62 |0..0> 256 1154.630000 393440 RAND (trimmed)
QFT 62 |0..0> 16777216 15.082800 3800 QFT (0 input)
QFT 62 |Wn> 16777216 136.141000 33720 QFT (W input)
BENCH 64 |0..0> 256 1274.880000 447260 RAND (trimmed)
QFT 64 |0..0> 16777216 17.094300 4420 QFT (0 input)
QFT 64 |Wn> 16777216 165.921000 40360 QFT (W input)
BENCH 66 |0..0> 256 1358.580000 454060 RAND (trimmed)
QFT 66 |0..0> 16777216 18.182900 4430 QFT (0 input)
QFT 66 |Wn> 16777216 184.707000 44620 QFT (W input)
BENCH 68 |0..0> 256 1474.950000 511090 RAND (trimmed)
QFT 68 |0..0> 16777216 20.357600 5090 QFT (0 input)
QFT 68 |Wn> 16777216 231.379000 56220 QFT (W input)
BENCH 70 |0..0> 256 1579.340000 506360 RAND (trimmed)
QFT 70 |0..0> 16777216 20.230600 4930 QFT (0 input)
QFT 70 |Wn> 16777216 288.459000 69720 QFT (W input)
QFT 72 |0..0> 16777216 24.923100 6400 QFT (0 input)
QFT 72 |Wn> 16777216 346.768000 86960 QFT (W input)
QFT 74 |0..0> 16777216 26.342600 6520 QFT (0 input)
QFT 74 |Wn> 16777216 456.392000 109730 QFT (W input)
QFT 76 |0..0> 16777216 27.884200 6910 QFT (0 input)
QFT 76 |Wn> 16777216 596.460000 150470 QFT (W input)
QFT 78 |0..0> 16777216 30.179600 7650 QFT (0 input)
QFT 78 |Wn> 16777216 627.008000 151430 QFT (W input)
QFT 80 |0..0> 16777216 32.779900 8300 QFT (0 input)
QFT 80 |Wn> 16777216 817.609000 198300 QFT (W input)
QFT 82 |0..0> 16777216 35.465700 8920 QFT (0 input)
QFT 82 |Wn> 16777216 862.174000 194590 QFT (W input)
QFT 84 |0..0> 16777216 37.657800 9220 QFT (0 input)
QFT 84 |Wn> 16777216 952.616000 230260 QFT (W input)
QFT 86 |0..0> 16777216 41.559900 10110 QFT (0 input)
QFT 86 |Wn> 16777216 1050.530000 250290 QFT (W input)
QFT 88 |0..0> 16777216 40.099200 10470 QFT (0 input)
QFT 88 |Wn> 16777216 1197.190000 287880 QFT (W input)
QFT 90 |0..0> 16777216 46.652300 11610 QFT (0 input)
QFT 90 |Wn> 16777216 1371.380000 324020 QFT (W input)

Weak scaling

Weak scaling – for a minimum number of ARCHER2 nodes required to fit the problem, starting from 1 node for 32 qubits and doubling for every new qubit. Parallel efficiency is poor, so this is clearly the most efficient approach. An exception is that 33 qubits can fit on 1 node (no MPI sendrecv buffer). QFT performed better then RAND due to prevalent diagonal gates. As predicted, there is a significant energy consumption gap between high and medium CPU clock frequencies.

Clock Frequency Impact

Clock frequency impact for the same scaling as on the weak scaling figure. The data for runtime/energy is displayed with respect to high frequency. New highm1 frequency was added, which lies above medium.

It is clear that the runtime penalty of decreasing the frequency is lower than the boost in energy efficiency – medium setting being the most optimal. Therefore, it is a viable strategy for greener simulations, adding only a few seconds of runtime cost.

ITensor Scaling

Tensor networks allow an efficient state vector representation via a Matrix Product State (MPS). Instead of storing \(2^n\) amplitudes, each qubit is represented by a site, which is connected to others with contractable bonds of dimensions that correspond to the entanglement of the state:

  • low entanglement states are compact and easy to evolve/contract
  • high entanglement bond size explodes, but can be truncated, which approximates the state

The platform used for the experiments is ITensor. It doesn’t support any parallelism for general tensor networks, but even under those limitations it can be better than QuEST for some circuits.

QFT is a circuit that doesn’t induce much entanglement, so it can be simulated very fast. It was fed with a trivial \(\vert 00...0 \rangle\) state, and a lightly entangled \(\vert W \rangle\) state defined as:

\[\vert W \rangle = \frac{1}{\sqrt n} (\vert 100...0 \rangle + \vert 010...0 \rangle + ... + \vert 000...1 \rangle)\]

RAND is highly entangling, which makes it difficult to simulate the full state. Promising results were achieved by trimming the number of bond dimensions. On the figure above maximum bond dimension of 256 was used for the trimmed state. However, its accuracy quickly degrades for more qubits.

Evaluation

So far, we have only tested the Schrödinger and MPS algorithms thoroughly. We used QuEST for state vector simulation, as on an HPC system it proved to be much faster than Qiskit. It offers parallelisation across \(2^p\) processes, and ARCHER2 was able to simulate up to 43 qubits on 2,048 nodes. Interestingly, downclocking the CPU can save much energy with little time penalty. We are also writing our own simulator, to get rid of \(2^p\) processes restriction, which we found leaves some nodes underused.  [6]

Tensor networks were tested in ITensor, which doesn’t offer message-passing parallelism, but even without it can be much faster than the Schrödinger method. As long as the entanglement in the system is low and keeps being truncated (which acts similar to noise), the runtime shows polynomial complexity. However, if we want a noiseless statevector, it performs much worse than the Schrödinger method.  [7]

Conclusion

The state vector approach is stable regardless of the entanglement. Due to high communication, it is most economical when running on the minimum number of nodes to fit the problem, and reducing the CPU clock frequency to medium.

In contrast, the runtime, and thus energy consumption of tensor networks is very dependent on the entanglement. If it is low, iTensor can drastically outperform QuEST despite featuring no parallelism. However, for highly entangled states, the full simulation is very slow – but can be approximated by truncating the bond dimensions.

In the future, tensor network simulations should be parallelised with OpenMP and MPI. Then they could be directly compared against state vector methods to figure out whether and when they are advantageous, especially in case of an approximated state.

References

[1]
M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information, 10th ed. (Cambridge University Press, 2012).
[2]
[3]
F. Pan and P. Zhang, Simulation of Quantum Circuits Using the Big-Batch Tensor Network Method, Physical Review Letters 128, (2022).
[4]
U. Schollwöck, The Density-Matrix Renormalization Group in the Age of Matrix Product States, Annals of Physics 326, 96 (2011).
[5]
Y. Zhou, E. M. Stoudenmire, and X. Waintal, What Limits the Simulation of Quantum Computers?, Physical Review X 10, (2020).
[6]
T. Jones, A. Brown, I. Bush, and S. C. Benjamin, QuEST and High Performance Simulation of Quantum Computers, Scientific Reports 9, (2019).
[7]
M. Fishman, S. White, and E. Stoudenmire, The ITensor Software Library for Tensor Network Calculations, SciPost Physics Codebases (2022).