As we are entering the era when quantum computing are becoming viable, it is especially important to focus on classical simulations of quantum systems. They can provide a useful insight into mechanisms behind quantum phenomena, and be used to verify the results of the new technology. In addition, current quantum devices are often call Noisy Intermediate-Scale Quantum (NISQ), due to high amounts of noise that affects them. As a result, it may still be possible to keep classical simulations competitive, and incentivise the development of both paradigms.
In general, simulating quantum circuits is an exponentially hard task – both in terms of runtime and memory. This is because a quantum state of \(n\) qubits is equivalent to some superposition of all possible \(n\)-bit strings, each weighted by a complex amplitude (whose modulus squared is equal to the probability of measuring the corresponding string). Therefore, there are \(2^n\) complex numbers describing the state. However, many states can be separated into products of smaller states. If a state is separable into single-qubit states, it is non-entangled. Otherwise, there is some degree of quantum entanglement present – which describes correlations in the measurements of individual qubits for the given system. This is what differentiates quantum computing from its classical counterpart, since classicaly, those measurements would be independent. There are methods for compressing states with less entanglement, which will be described below. Notwithstanding that, it is quantum simulations are obviously very energy inefficient.
Two different circuits were selected for benchmarking – Quantum Fourier Transform (QFT) and Random Quantum Circuit (RAND). Likewise, two platforms were used two create and run the simulations, each representing a distinct approach. Quantum Exact Simulation Toolkit (QuEST) was used for full state vector simulations, while iTensor was leveraged to perform circuit simulations by contracting tensor networks of matrix product states (MPS) and matrix product operators (MPO).
Each approach allows different modifications to save the energy. QuEST requires multiprocessing for larger simulations due to memory limitations, which makes it very heavy on inter-node communication. Therefore, it was found to be beneficial to slightly reduce CPU clock frequency, so that most of the CPU time is not wasted waiting for the messages to arrive. On the other hand, iTensor is currently completely serial and single-threaded, but features a representation that makes compression easy for states with limited entanglement. Hence, it was found to be better for some algorithms and input states, despite its limitations.
Below presented are circuits and frameworks used in this work.
QFT is a very common component of other quantum algorithms. It does a quantum analog to Discrete Fourier Transform on the input state, described by the following formula: \[ QFT_N \lvert x \rangle = \frac{1}{\sqrt{N}}\sum_{y=0}^{N-1} \omega_N^{xy} \lvert y \rangle \]
Where \(N=2^n\) is the number of basis states, and \(\omega_N=e^{\frac{2\pi i}{N}}\). This can be re-expressed as a tensor product of single-qubit states: \[ QFT_N \lvert x \rangle = \frac{1}{\sqrt{N}}\bigotimes_{k=1}^n \left( \lvert 0 \rangle + \exp{\frac{2\pi ix}{2^k} \lvert 1 \rangle} \right) \]
Therefore, if the input state is the basis state \(\lvert x \rangle\), its Quantum Fourier Transform won’t introduce any new entanglement. However, if the input is in a superposition of basis states \(\lvert \psi \rangle = \sum_{x=0}^{N-1} \psi_x \lvert x \rangle\), the corresponding transform will be a superposition of transforms of each basis state. This way, some amplitudes may cancel each other out, and alter the entanglement of the result. We say that the QFT is weakly entangling, which means its output can have a different level of entanglement than input, but it is unlikely to arbitrarily create a very entangled state.
The circuit is easy to verify, as inputting a \(\vert 0 \rangle\) state should output an equal superposition of all states. It consists of \(\mathcal{O}(n)\) non-diagonal Hadamard gates and \(\mathcal{O}(n^2)\) diagonal controlled \(\theta\) phase shift gates. [1]
A simplified version of the circuit used in Google’s quantum advantage claim. It is built of multiple layers that create entanglement. First, each qubit is acted on by a random gate from the set \([\sqrt{X}, \sqrt{Y}, \sqrt{W}]\); the neighbouring qubits are then entangled with controlled \(\frac{\pi}{2}\) phase shift gates. At least \(n\) layers are necessary to significantly entangle \(n\) qubits. The circuit has \(\mathcal{O}(n^2)\) random non-diagonal single-qubit gates, and likewise, \(\mathcal{O}(n^2)\) diagonal two-qubit \(CP\) gates. [2]
There are two main kinds of methods for solving quantum circuits. The first one is based on a vector representation of the statevector, and requires 𝒪(2n) memory for n qubits. It is possible to split the state into multiple parts, and connect them via Feynman paths, but this increases time complexity by a factor of 4 for each connection between parts. As a result, this algorithm is very expensive either in terms of memory requirements, or execution time. [3]
An alternative is to treat the circuit as a network of tensors. This approach is much more flexible, as the contraction between tensors can be done in any order. In addition, such a structure is perfect for immediate computation of required amplitudes – by appending a given output state to the network. This is a full contraction, which yields a scalar, and under optimal contraction order can be a lot faster than computing the whole statevector. However, most of the information is lost that way, and if too many amplitudes are needed, the runtime rapidly increases. This problem has been addressed in the literature, and many methods for computing batch amplitudes have been developed. [3]
All of the above methods are exponential in some parameter. In principle, this is what makes quantum computers better than their classical counterparts. However, if we factor in noise, we can develop polynomial algorithms. For example, a statevector can be represented by a matrix product state (MPS), which stores the entanglement between qubits in singular values. If we limit the number of singular values, and truncate them as the circuit evolves, we emulate a kind of decoherence noise. This limits the size of the statevector and the number of computations to a polynomial order. [4,5]
The results were written by a SLURM script to an output file in a CSV format with three columns – Nqubits, Ntasks and Time (in ms). In order to save on credits, most jobs were submitted for a fixed number of nodes, as otherwise some of the reserved resources would stay idle. The outputs of each series of computations were combined by hand.
The final raw data transformed into more readable form are presented below. \(N=2^k\) columns represent the number of nodes, while floating point values represent timings in seconds. Some timings are not available due to memory limits, and these are noted as NA.
QuEST timings:
Frequency | Circuit | Nqubits | N=1 | N=2 | N=4 | N=8 | N=16 | N=32 | N=64 | N=128 |
---|---|---|---|---|---|---|---|---|---|---|
Low | QFT | 32 | 137.8 | 89.2 | 50.4 | 30.1 | 17.3 | 9.8 | 5.4 | 2.4 |
Medium | QFT | 32 | 129.5 | 84.1 | 47.4 | 27.7 | 15.9 | 9.2 | 5.0 | 2.1 |
Highm1 | QFT | 32 | 129.4 | 84.4 | 47.4 | 28.4 | 16.3 | 9.0 | 5.0 | 2.1 |
High | QFT | 32 | 124.9 | 80.5 | 45.5 | 27.1 | 15.2 | 8.8 | 4.7 | 1.8 |
Low | QFT | 33 | 281.7 | 189.7 | 103.9 | 60.9 | 34.7 | 19.4 | 10.8 | 6.0 |
Medium | QFT | 33 | 272.5 | 179.7 | 98.5 | 57.6 | 32.9 | 18.6 | 10.2 | 5.5 |
Highm1 | QFT | 33 | 271.7 | 178.3 | 95.1 | 57.7 | 32.1 | 18.6 | 10.2 | 5.5 |
High | QFT | 33 | 251.2 | 169.1 | 93.3 | 53.7 | 31.4 | 17.3 | 9.7 | 5.1 |
Low | QFT | 34 | NA | NA | 216.1 | 125.4 | 70.6 | 39.8 | 21.8 | 12.2 |
Medium | QFT | 34 | NA | NA | 204.7 | 119.0 | 64.9 | 37.6 | 20.7 | 11.4 |
Highm1 | QFT | 34 | NA | NA | 203.6 | 118.1 | 65.4 | 37.5 | 20.6 | 11.4 |
High | QFT | 34 | NA | NA | 192.7 | 110.7 | 63.3 | 35.8 | 19.6 | 10.8 |
Low | QFT | 35 | NA | NA | NA | 265.2 | 141.3 | 80.1 | 44.4 | 24.4 |
Medium | QFT | 35 | NA | NA | NA | 243.5 | 135.1 | 75.7 | 42.0 | 23.0 |
Highm1 | QFT | 35 | NA | NA | NA | 244.1 | 134.8 | 75.7 | 42.1 | 23.3 |
High | QFT | 35 | NA | NA | NA | 230.8 | 128.3 | 72.4 | 39.9 | 21.8 |
Low | QFT | 36 | NA | NA | NA | NA | 295.2 | 164.1 | 90.6 | 49.5 |
Medium | QFT | 36 | NA | NA | NA | NA | 278.3 | 154.5 | 85.7 | 46.5 |
Highm1 | QFT | 36 | NA | NA | NA | NA | 279.0 | 154.6 | 85.2 | 46.4 |
High | QFT | 36 | NA | NA | NA | NA | 263.1 | 146.6 | 81.4 | 44.2 |
Low | RAND | 32 | 768.8 | 581.4 | 397.6 | 259.0 | 156.7 | 93.0 | 52.7 | 28.4 |
Medium | RAND | 32 | 743.0 | 551.7 | 374.9 | 242.0 | 146.9 | 86.7 | 48.9 | 25.9 |
Highm1 | RAND | 32 | 741.3 | 555.5 | 373.9 | 243.7 | 147.2 | 93.8 | 48.8 | 25.8 |
High | RAND | 32 | 712.9 | 528.0 | 354.9 | 229.1 | 139.1 | 81.9 | 46.0 | 24.1 |
Low | RAND | 33 | 1626.4 | 1224.5 | 814.8 | 526.4 | 324.1 | 191.8 | 110.7 | 61.7 |
Medium | RAND | 33 | 1572.8 | 1163.3 | 768.0 | 498.4 | 304.0 | 179.1 | 103.4 | 57.6 |
Highm1 | RAND | 33 | 1578.9 | 1165.3 | 765.2 | 495.6 | 305.2 | 179.4 | 103.4 | 57.5 |
High | RAND | 33 | 1523.7 | 1101.9 | 724.6 | 470.5 | 287.0 | 169.5 | 104.1 | 53.9 |
Low | RAND | 34 | NA | NA | 1665.3 | 1079.5 | 653.3 | 396.2 | 227.5 | 129.4 |
Medium | RAND | 34 | NA | NA | 1564.2 | 1011.3 | 613.2 | 370.5 | 231.2 | 120.9 |
Highm1 | RAND | 34 | NA | NA | 1559.6 | 1007.1 | 612.2 | 370.5 | 213.0 | 121.1 |
High | RAND | 34 | NA | NA | 1476.2 | 971.9 | 577.6 | 349.3 | 201.0 | 113.9 |
Low | RAND | 35 | NA | NA | NA | 2175.8 | 1322.8 | 803.5 | 473.6 | 277.2 |
Medium | RAND | 35 | NA | NA | NA | 2033.4 | 1231.9 | 750.7 | 443.9 | 249.5 |
Highm1 | RAND | 35 | NA | NA | NA | 2043.6 | 1237.9 | 750.5 | 443.9 | 248.6 |
High | RAND | 35 | NA | NA | NA | 1943.4 | 1158.4 | 708.0 | 417.9 | 234.0 |
Low | RAND | 36 | NA | NA | NA | NA | 2692.7 | 1619.6 | 979.4 | 552.7 |
Medium | RAND | 36 | NA | NA | NA | NA | 2511.6 | 1505.8 | 916.2 | 516.4 |
Highm1 | RAND | 36 | NA | NA | NA | NA | 2505.7 | 1510.4 | 917.4 | 515.8 |
High | RAND | 36 | NA | NA | NA | NA | 2349.9 | 1414.4 | 862.1 | 486.1 |
ITensor timings:
Circuit | Nqubits | InitState | MaxDim | Runtime | Energy | Experiment |
---|---|---|---|---|---|---|
BENCH | 12 | |0..0> | 256 | 0.805679 | 622 | RAND (trimmed) |
BENCH | 12 | |0..0> | 16777216 | 0.802858 | 694 | RAND (full) |
QFT | 12 | |0..0> | 16777216 | 0.321774 | 315 | QFT (0 input) |
QFT | 12 | |Wn> | 16777216 | 0.252076 | 276 | QFT (W input) |
BENCH | 14 | |0..0> | 256 | 1.377220 | 870 | RAND (trimmed) |
BENCH | 14 | |0..0> | 16777216 | 1.378290 | 886 | RAND (full) |
QFT | 14 | |0..0> | 16777216 | 0.398844 | 358 | QFT (0 input) |
QFT | 14 | |Wn> | 16777216 | 0.342208 | 345 | QFT (W input) |
BENCH | 16 | |0..0> | 256 | 2.885560 | 1320 | RAND (trimmed) |
BENCH | 16 | |0..0> | 16777216 | 2.807600 | 1150 | RAND (full) |
QFT | 16 | |0..0> | 16777216 | 0.482060 | 326 | QFT (0 input) |
QFT | 16 | |Wn> | 16777216 | 0.475178 | 294 | QFT (W input) |
BENCH | 18 | |0..0> | 256 | 7.969530 | 2820 | RAND (trimmed) |
BENCH | 18 | |0..0> | 16777216 | 10.585000 | 3360 | RAND (full) |
QFT | 18 | |0..0> | 16777216 | 0.595939 | 345 | QFT (0 input) |
QFT | 18 | |Wn> | 16777216 | 0.610556 | 360 | QFT (W input) |
BENCH | 20 | |0..0> | 256 | 17.117900 | 4740 | RAND (trimmed) |
BENCH | 20 | |0..0> | 16777216 | 53.225500 | 14380 | RAND (full) |
QFT | 20 | |0..0> | 16777216 | 0.682830 | 364 | QFT (0 input) |
QFT | 20 | |Wn> | 16777216 | 0.811472 | 346 | QFT (W input) |
BENCH | 22 | |0..0> | 256 | 30.693400 | 8260 | RAND (trimmed) |
BENCH | 22 | |0..0> | 16777216 | 331.654000 | 81070 | RAND (full) |
QFT | 22 | |0..0> | 16777216 | 0.869382 | 370 | QFT (0 input) |
QFT | 22 | |Wn> | 16777216 | 1.081210 | 549 | QFT (W input) |
BENCH | 24 | |0..0> | 256 | 47.553300 | 12580 | RAND (trimmed) |
BENCH | 24 | |0..0> | 16777216 | 1563.180000 | 379340 | RAND (full) |
QFT | 24 | |0..0> | 16777216 | 1.131030 | 480 | QFT (0 input) |
QFT | 24 | |Wn> | 16777216 | 1.358740 | 548 | QFT (W input) |
BENCH | 26 | |0..0> | 256 | 68.717700 | 17140 | RAND (trimmed) |
QFT | 26 | |0..0> | 16777216 | 1.384400 | 530 | QFT (0 input) |
QFT | 26 | |Wn> | 16777216 | 1.713760 | 621 | QFT (W input) |
BENCH | 28 | |0..0> | 256 | 94.232300 | 25920 | RAND (trimmed) |
QFT | 28 | |0..0> | 16777216 | 1.670910 | 607 | QFT (0 input) |
QFT | 28 | |Wn> | 16777216 | 2.097140 | 670 | QFT (W input) |
BENCH | 30 | |0..0> | 256 | 123.518000 | 31140 | RAND (trimmed) |
QFT | 30 | |0..0> | 16777216 | 1.966470 | 649 | QFT (0 input) |
QFT | 30 | |Wn> | 16777216 | 2.447190 | 851 | QFT (W input) |
BENCH | 32 | |0..0> | 256 | 157.958000 | 40530 | RAND (trimmed) |
QFT | 32 | |0..0> | 16777216 | 2.297540 | 724 | QFT (0 input) |
QFT | 32 | |Wn> | 16777216 | 2.878970 | 843 | QFT (W input) |
BENCH | 34 | |0..0> | 256 | 193.612000 | 57920 | RAND (trimmed) |
QFT | 34 | |0..0> | 16777216 | 2.702240 | 878 | QFT (0 input) |
QFT | 34 | |Wn> | 16777216 | 3.955050 | 1240 | QFT (W input) |
BENCH | 36 | |0..0> | 256 | 236.021000 | 72610 | RAND (trimmed) |
QFT | 36 | |0..0> | 16777216 | 3.230490 | 1120 | QFT (0 input) |
QFT | 36 | |Wn> | 16777216 | 5.529280 | 1610 | QFT (W input) |
BENCH | 38 | |0..0> | 256 | 279.868000 | 80810 | RAND (trimmed) |
QFT | 38 | |0..0> | 16777216 | 3.677020 | 1060 | QFT (0 input) |
QFT | 38 | |Wn> | 16777216 | 7.410140 | 2080 | QFT (W input) |
BENCH | 40 | |0..0> | 256 | 332.627000 | 93580 | RAND (trimmed) |
QFT | 40 | |0..0> | 16777216 | 4.413010 | 1280 | QFT (0 input) |
QFT | 40 | |Wn> | 16777216 | 10.517300 | 2540 | QFT (W input) |
BENCH | 42 | |0..0> | 256 | 402.321000 | 124050 | RAND (trimmed) |
QFT | 42 | |0..0> | 16777216 | 4.727110 | 1350 | QFT (0 input) |
QFT | 42 | |Wn> | 16777216 | 13.946100 | 3470 | QFT (W input) |
BENCH | 44 | |0..0> | 256 | 443.380000 | 126170 | RAND (trimmed) |
QFT | 44 | |0..0> | 16777216 | 5.732970 | 1670 | QFT (0 input) |
QFT | 44 | |Wn> | 16777216 | 18.032500 | 4530 | QFT (W input) |
BENCH | 46 | |0..0> | 256 | 502.356000 | 149490 | RAND (trimmed) |
QFT | 46 | |0..0> | 16777216 | 6.363610 | 1700 | QFT (0 input) |
QFT | 46 | |Wn> | 16777216 | 23.087000 | 5530 | QFT (W input) |
BENCH | 48 | |0..0> | 256 | 573.018000 | 172720 | RAND (trimmed) |
QFT | 48 | |0..0> | 16777216 | 7.198300 | 2090 | QFT (0 input) |
QFT | 48 | |Wn> | 16777216 | 28.961300 | 7470 | QFT (W input) |
BENCH | 50 | |0..0> | 256 | 650.325000 | 205200 | RAND (trimmed) |
QFT | 50 | |0..0> | 16777216 | 7.752420 | 1910 | QFT (0 input) |
QFT | 50 | |Wn> | 16777216 | 36.582000 | 8740 | QFT (W input) |
BENCH | 52 | |0..0> | 256 | 713.854000 | 220620 | RAND (trimmed) |
QFT | 52 | |0..0> | 16777216 | 9.426870 | 2410 | QFT (0 input) |
QFT | 52 | |Wn> | 16777216 | 45.634500 | 11640 | QFT (W input) |
BENCH | 54 | |0..0> | 256 | 794.648000 | 269100 | RAND (trimmed) |
QFT | 54 | |0..0> | 16777216 | 10.160100 | 2800 | QFT (0 input) |
QFT | 54 | |Wn> | 16777216 | 56.369100 | 13620 | QFT (W input) |
BENCH | 56 | |0..0> | 256 | 886.571000 | 291100 | RAND (trimmed) |
QFT | 56 | |0..0> | 16777216 | 11.736600 | 2720 | QFT (0 input) |
QFT | 56 | |Wn> | 16777216 | 69.114500 | 17390 | QFT (W input) |
BENCH | 58 | |0..0> | 256 | 962.006000 | 303530 | RAND (trimmed) |
QFT | 58 | |0..0> | 16777216 | 12.803100 | 3350 | QFT (0 input) |
QFT | 58 | |Wn> | 16777216 | 89.085500 | 21630 | QFT (W input) |
BENCH | 60 | |0..0> | 256 | 1059.990000 | 382820 | RAND (trimmed) |
QFT | 60 | |0..0> | 16777216 | 13.835500 | 3500 | QFT (0 input) |
QFT | 60 | |Wn> | 16777216 | 106.931000 | 24800 | QFT (W input) |
BENCH | 62 | |0..0> | 256 | 1154.630000 | 393440 | RAND (trimmed) |
QFT | 62 | |0..0> | 16777216 | 15.082800 | 3800 | QFT (0 input) |
QFT | 62 | |Wn> | 16777216 | 136.141000 | 33720 | QFT (W input) |
BENCH | 64 | |0..0> | 256 | 1274.880000 | 447260 | RAND (trimmed) |
QFT | 64 | |0..0> | 16777216 | 17.094300 | 4420 | QFT (0 input) |
QFT | 64 | |Wn> | 16777216 | 165.921000 | 40360 | QFT (W input) |
BENCH | 66 | |0..0> | 256 | 1358.580000 | 454060 | RAND (trimmed) |
QFT | 66 | |0..0> | 16777216 | 18.182900 | 4430 | QFT (0 input) |
QFT | 66 | |Wn> | 16777216 | 184.707000 | 44620 | QFT (W input) |
BENCH | 68 | |0..0> | 256 | 1474.950000 | 511090 | RAND (trimmed) |
QFT | 68 | |0..0> | 16777216 | 20.357600 | 5090 | QFT (0 input) |
QFT | 68 | |Wn> | 16777216 | 231.379000 | 56220 | QFT (W input) |
BENCH | 70 | |0..0> | 256 | 1579.340000 | 506360 | RAND (trimmed) |
QFT | 70 | |0..0> | 16777216 | 20.230600 | 4930 | QFT (0 input) |
QFT | 70 | |Wn> | 16777216 | 288.459000 | 69720 | QFT (W input) |
QFT | 72 | |0..0> | 16777216 | 24.923100 | 6400 | QFT (0 input) |
QFT | 72 | |Wn> | 16777216 | 346.768000 | 86960 | QFT (W input) |
QFT | 74 | |0..0> | 16777216 | 26.342600 | 6520 | QFT (0 input) |
QFT | 74 | |Wn> | 16777216 | 456.392000 | 109730 | QFT (W input) |
QFT | 76 | |0..0> | 16777216 | 27.884200 | 6910 | QFT (0 input) |
QFT | 76 | |Wn> | 16777216 | 596.460000 | 150470 | QFT (W input) |
QFT | 78 | |0..0> | 16777216 | 30.179600 | 7650 | QFT (0 input) |
QFT | 78 | |Wn> | 16777216 | 627.008000 | 151430 | QFT (W input) |
QFT | 80 | |0..0> | 16777216 | 32.779900 | 8300 | QFT (0 input) |
QFT | 80 | |Wn> | 16777216 | 817.609000 | 198300 | QFT (W input) |
QFT | 82 | |0..0> | 16777216 | 35.465700 | 8920 | QFT (0 input) |
QFT | 82 | |Wn> | 16777216 | 862.174000 | 194590 | QFT (W input) |
QFT | 84 | |0..0> | 16777216 | 37.657800 | 9220 | QFT (0 input) |
QFT | 84 | |Wn> | 16777216 | 952.616000 | 230260 | QFT (W input) |
QFT | 86 | |0..0> | 16777216 | 41.559900 | 10110 | QFT (0 input) |
QFT | 86 | |Wn> | 16777216 | 1050.530000 | 250290 | QFT (W input) |
QFT | 88 | |0..0> | 16777216 | 40.099200 | 10470 | QFT (0 input) |
QFT | 88 | |Wn> | 16777216 | 1197.190000 | 287880 | QFT (W input) |
QFT | 90 | |0..0> | 16777216 | 46.652300 | 11610 | QFT (0 input) |
QFT | 90 | |Wn> | 16777216 | 1371.380000 | 324020 | QFT (W input) |
Weak scaling – for a minimum number of ARCHER2 nodes
required to fit the problem, starting from 1 node for 32 qubits and
doubling for every new qubit. Parallel efficiency is
poor, so this is clearly the most efficient approach. An exception is
that 33 qubits can fit on 1 node (no MPI sendrecv
buffer).
QFT performed better then RAND due to prevalent diagonal gates. As
predicted, there is a significant energy consumption gap
between high and medium CPU clock frequencies.
Clock frequency impact for the same scaling as on the weak scaling figure. The data for runtime/energy is displayed with respect to high frequency. New highm1 frequency was added, which lies above medium.
It is clear that the runtime penalty of decreasing the frequency is lower than the boost in energy efficiency – medium setting being the most optimal. Therefore, it is a viable strategy for greener simulations, adding only a few seconds of runtime cost.
Tensor networks allow an efficient state vector representation via a Matrix Product State (MPS). Instead of storing \(2^n\) amplitudes, each qubit is represented by a site, which is connected to others with contractable bonds of dimensions that correspond to the entanglement of the state:
The platform used for the experiments is ITensor. It doesn’t support any parallelism for general tensor networks, but even under those limitations it can be better than QuEST for some circuits.
QFT is a circuit that doesn’t induce much entanglement, so it can be simulated very fast. It was fed with a trivial \(\vert 00...0 \rangle\) state, and a lightly entangled \(\vert W \rangle\) state defined as:
\[\vert W \rangle = \frac{1}{\sqrt n} (\vert 100...0 \rangle + \vert 010...0 \rangle + ... + \vert 000...1 \rangle)\]
RAND is highly entangling, which makes it difficult to simulate the full state. Promising results were achieved by trimming the number of bond dimensions. On the figure above maximum bond dimension of 256 was used for the trimmed state. However, its accuracy quickly degrades for more qubits.
So far, we have only tested the Schrödinger and MPS algorithms thoroughly. We used QuEST for state vector simulation, as on an HPC system it proved to be much faster than Qiskit. It offers parallelisation across \(2^p\) processes, and ARCHER2 was able to simulate up to 43 qubits on 2,048 nodes. Interestingly, downclocking the CPU can save much energy with little time penalty. We are also writing our own simulator, to get rid of \(2^p\) processes restriction, which we found leaves some nodes underused. [6]
Tensor networks were tested in ITensor, which doesn’t offer message-passing parallelism, but even without it can be much faster than the Schrödinger method. As long as the entanglement in the system is low and keeps being truncated (which acts similar to noise), the runtime shows polynomial complexity. However, if we want a noiseless statevector, it performs much worse than the Schrödinger method. [7]
The state vector approach is stable regardless of the entanglement. Due to high communication, it is most economical when running on the minimum number of nodes to fit the problem, and reducing the CPU clock frequency to medium.
In contrast, the runtime, and thus energy consumption of tensor networks is very dependent on the entanglement. If it is low, iTensor can drastically outperform QuEST despite featuring no parallelism. However, for highly entangled states, the full simulation is very slow – but can be approximated by truncating the bond dimensions.
In the future, tensor network simulations should be parallelised with OpenMP and MPI. Then they could be directly compared against state vector methods to figure out whether and when they are advantageous, especially in case of an approximated state.