High Performance Simulations of Quantum Computing

Introduction

As we are entering the era when quantum computing are becoming viable, it is especially important to focus on classical simulations of quantum systems. They can provide a useful insight into mechanisms behind quantum phenomena, and be used to verify the results of the new technology. In addition, current quantum devices are often call Noisy Intermediate-Scale Quantum (NISQ), due to high amounts of noise that affects them. As a result, it may still be possible to keep classical simulations competitive, and incentivise the development of both paradigms.

In general, simulating quantum circuits is an exponentially hard task – both in terms of runtime and memory. This is because a quantum state of \(n\) qubits is equivalent to some superposition of all possible \(n\)-bit strings, each weighted by a complex amplitude (whose modulus squared is equal to the probability of measuring the corresponding string). Therefore, there are \(2^n\) complex numbers describing the state. However, many states can be separated into products of smaller states. If a state is separable into single-qubit states, it is non-entangled. Otherwise, there is some degree of quantum entanglement present – which describes correlations in the measurements of individual qubits for the given system. This is what differentiates quantum computing from its classical counterpart, since classicaly, those measurements would be independent. There are methods for compressing states with less entanglement, which will be described below. Notwithstanding that, it is quantum simulations are obviously very energy inefficient.

Two different circuits were selected for benchmarking – Quantum Fourier Transform (QFT) and Random Quantum Circuit (RAND). Likewise, two platforms were used two create and run the simulations, each representing a distinct approach. Quantum Exact Simulation Toolkit (QuEST) was used for full state vector simulations, while iTensor was leveraged to perform circuit simulations by contracting tensor networks of matrix product states (MPS) and matrix product operators (MPO).

Each approach allows different modifications to save the energy. QuEST requires multiprocessing for larger simulations due to memory limitations, which makes it very heavy on inter-node communication. Therefore, it was found to be beneficial to slightly reduce CPU clock frequency, so that most of the CPU time is not wasted waiting for the messages to arrive. On the other hand, iTensor is currently completely serial and single-threaded, but features a representation that makes compression easy for states with limited entanglement. Hence, it was found to be better for some algorithms and input states, despite its limitations.

Methodology

Below presented are circuits and frameworks used in this work.

Quantum Fourier Transform

QFT is a very common component of other quantum algorithms. It does a quantum analog to Discrete Fourier Transform on the input state, described by the following formula: \[ QFT_N \lvert x \rangle = \frac{1}{\sqrt{N}}\sum_{y=0}^{N-1} \omega_N^{xy} \lvert y \rangle \]

Where \(N=2^n\) is the number of basis states, and \(\omega_N=e^{\frac{2\pi i}{N}}\). This can be re-expressed as a tensor product of single-qubit states: \[ QFT_N \lvert x \rangle = \frac{1}{\sqrt{N}}\bigotimes_{k=1}^n \left( \lvert 0 \rangle + \exp{\frac{2\pi ix}{2^k} \lvert 1 \rangle} \right) \]

Therefore, if the input state is the basis state \(\lvert x \rangle\), its Quantum Fourier Transform won’t introduce any new entanglement. However, if the input is in a superposition of basis states \(\lvert \psi \rangle = \sum_{x=0}^{N-1} \psi_x \lvert x \rangle\), the corresponding transform will be a superposition of transforms of each basis state. This way, some amplitudes may cancel each other out, and alter the entanglement of the result. We say that the QFT is weakly entangling, which means its output can have a different level of entanglement than input, but it is unlikely to arbitrarily create a very entangled state.

The circuit is easy to verify, as inputting a \(\vert 0 \rangle\) state should output an equal superposition of all states. It consists of \(\mathcal{O}(n)\) non-diagonal Hadamard gates and \(\mathcal{O}(n^2)\) diagonal controlled \(\theta\) phase shift gates. [1]

Random Circuit

A simplified version of the circuit used in Google’s quantum advantage claim. It is built of multiple layers that create entanglement. First, each qubit is acted on by a random gate from the set \([\sqrt{X}, \sqrt{Y}, \sqrt{W}]\); the neighbouring qubits are then entangled with controlled \(\frac{\pi}{2}\) phase shift gates. At least \(n\) layers are necessary to significantly entangle \(n\) qubits. The circuit has \(\mathcal{O}(n^2)\) random non-diagonal single-qubit gates, and likewise, \(\mathcal{O}(n^2)\) diagonal two-qubit \(CP\) gates. [2]

Simulations

There are two main kinds of methods for solving quantum circuits. The first one is based on a vector representation of the statevector, and requires 𝒪(2n) memory for n qubits. It is possible to split the state into multiple parts, and connect them via Feynman paths, but this increases time complexity by a factor of 4 for each connection between parts. As a result, this algorithm is very expensive either in terms of memory requirements, or execution time. [3]

An alternative is to treat the circuit as a network of tensors. This approach is much more flexible, as the contraction between tensors can be done in any order. In addition, such a structure is perfect for immediate computation of required amplitudes – by appending a given output state to the network. This is a full contraction, which yields a scalar, and under optimal contraction order can be a lot faster than computing the whole statevector. However, most of the information is lost that way, and if too many amplitudes are needed, the runtime rapidly increases. This problem has been addressed in the literature, and many methods for computing batch amplitudes have been developed. [3]

All of the above methods are exponential in some parameter. In principle, this is what makes quantum computers better than their classical counterparts. However, if we factor in noise, we can develop polynomial algorithms. For example, a statevector can be represented by a matrix product state (MPS), which stores the entanglement between qubits in singular values. If we limit the number of singular values, and truncate them as the circuit evolves, we emulate a kind of decoherence noise. This limits the size of the statevector and the number of computations to a polynomial order. [4,5]

Results

The results were written by a SLURM script to an output file in a CSV format with three columns – Nqubits, Ntasks and Time (in ms). In order to save on credits, most jobs were submitted for a fixed number of nodes, as otherwise some of the reserved resources would stay idle. The outputs of each series of computations were combined by hand.

The final raw data transformed into more readable form are presented below. \(N=2^k\) columns represent the number of nodes, while floating point values represent timings in seconds. Some timings are not available due to memory limits, and these are noted as NA.

QuEST timings:

Frequency	Circuit	Nqubits	N=1	N=2	N=4	N=8	N=16	N=32	N=64	N=128
Low	QFT	32	137.8	89.2	50.4	30.1	17.3	9.8	5.4	2.4
Medium	QFT	32	129.5	84.1	47.4	27.7	15.9	9.2	5.0	2.1
Highm1	QFT	32	129.4	84.4	47.4	28.4	16.3	9.0	5.0	2.1
High	QFT	32	124.9	80.5	45.5	27.1	15.2	8.8	4.7	1.8
Low	QFT	33	281.7	189.7	103.9	60.9	34.7	19.4	10.8	6.0
Medium	QFT	33	272.5	179.7	98.5	57.6	32.9	18.6	10.2	5.5
Highm1	QFT	33	271.7	178.3	95.1	57.7	32.1	18.6	10.2	5.5
High	QFT	33	251.2	169.1	93.3	53.7	31.4	17.3	9.7	5.1
Low	QFT	34	NA	NA	216.1	125.4	70.6	39.8	21.8	12.2
Medium	QFT	34	NA	NA	204.7	119.0	64.9	37.6	20.7	11.4
Highm1	QFT	34	NA	NA	203.6	118.1	65.4	37.5	20.6	11.4
High	QFT	34	NA	NA	192.7	110.7	63.3	35.8	19.6	10.8
Low	QFT	35	NA	NA	NA	265.2	141.3	80.1	44.4	24.4
Medium	QFT	35	NA	NA	NA	243.5	135.1	75.7	42.0	23.0
Highm1	QFT	35	NA	NA	NA	244.1	134.8	75.7	42.1	23.3
High	QFT	35	NA	NA	NA	230.8	128.3	72.4	39.9	21.8
Low	QFT	36	NA	NA	NA	NA	295.2	164.1	90.6	49.5
Medium	QFT	36	NA	NA	NA	NA	278.3	154.5	85.7	46.5
Highm1	QFT	36	NA	NA	NA	NA	279.0	154.6	85.2	46.4
High	QFT	36	NA	NA	NA	NA	263.1	146.6	81.4	44.2
Low	RAND	32	768.8	581.4	397.6	259.0	156.7	93.0	52.7	28.4
Medium	RAND	32	743.0	551.7	374.9	242.0	146.9	86.7	48.9	25.9
Highm1	RAND	32	741.3	555.5	373.9	243.7	147.2	93.8	48.8	25.8
High	RAND	32	712.9	528.0	354.9	229.1	139.1	81.9	46.0	24.1
Low	RAND	33	1626.4	1224.5	814.8	526.4	324.1	191.8	110.7	61.7
Medium	RAND	33	1572.8	1163.3	768.0	498.4	304.0	179.1	103.4	57.6
Highm1	RAND	33	1578.9	1165.3	765.2	495.6	305.2	179.4	103.4	57.5
High	RAND	33	1523.7	1101.9	724.6	470.5	287.0	169.5	104.1	53.9
Low	RAND	34	NA	NA	1665.3	1079.5	653.3	396.2	227.5	129.4
Medium	RAND	34	NA	NA	1564.2	1011.3	613.2	370.5	231.2	120.9
Highm1	RAND	34	NA	NA	1559.6	1007.1	612.2	370.5	213.0	121.1
High	RAND	34	NA	NA	1476.2	971.9	577.6	349.3	201.0	113.9
Low	RAND	35	NA	NA	NA	2175.8	1322.8	803.5	473.6	277.2
Medium	RAND	35	NA	NA	NA	2033.4	1231.9	750.7	443.9	249.5
Highm1	RAND	35	NA	NA	NA	2043.6	1237.9	750.5	443.9	248.6
High	RAND	35	NA	NA	NA	1943.4	1158.4	708.0	417.9	234.0
Low	RAND	36	NA	NA	NA	NA	2692.7	1619.6	979.4	552.7
Medium	RAND	36	NA	NA	NA	NA	2511.6	1505.8	916.2	516.4
Highm1	RAND	36	NA	NA	NA	NA	2505.7	1510.4	917.4	515.8
High	RAND	36	NA	NA	NA	NA	2349.9	1414.4	862.1	486.1

ITensor timings:

Circuit	Nqubits	InitState	MaxDim	Runtime	Energy	Experiment
BENCH	12	\|0..0>	256	0.805679	622	RAND (trimmed)
BENCH	12	\|0..0>	16777216	0.802858	694	RAND (full)
QFT	12	\|0..0>	16777216	0.321774	315	QFT (0 input)
QFT	12	\|Wn>	16777216	0.252076	276	QFT (W input)
BENCH	14	\|0..0>	256	1.377220	870	RAND (trimmed)
BENCH	14	\|0..0>	16777216	1.378290	886	RAND (full)
QFT	14	\|0..0>	16777216	0.398844	358	QFT (0 input)
QFT	14	\|Wn>	16777216	0.342208	345	QFT (W input)
BENCH	16	\|0..0>	256	2.885560	1320	RAND (trimmed)
BENCH	16	\|0..0>	16777216	2.807600	1150	RAND (full)
QFT	16	\|0..0>	16777216	0.482060	326	QFT (0 input)
QFT	16	\|Wn>	16777216	0.475178	294	QFT (W input)
BENCH	18	\|0..0>	256	7.969530	2820	RAND (trimmed)
BENCH	18	\|0..0>	16777216	10.585000	3360	RAND (full)
QFT	18	\|0..0>	16777216	0.595939	345	QFT (0 input)
QFT	18	\|Wn>	16777216	0.610556	360	QFT (W input)
BENCH	20	\|0..0>	256	17.117900	4740	RAND (trimmed)
BENCH	20	\|0..0>	16777216	53.225500	14380	RAND (full)
QFT	20	\|0..0>	16777216	0.682830	364	QFT (0 input)
QFT	20	\|Wn>	16777216	0.811472	346	QFT (W input)
BENCH	22	\|0..0>	256	30.693400	8260	RAND (trimmed)
BENCH	22	\|0..0>	16777216	331.654000	81070	RAND (full)
QFT	22	\|0..0>	16777216	0.869382	370	QFT (0 input)
QFT	22	\|Wn>	16777216	1.081210	549	QFT (W input)
BENCH	24	\|0..0>	256	47.553300	12580	RAND (trimmed)
BENCH	24	\|0..0>	16777216	1563.180000	379340	RAND (full)
QFT	24	\|0..0>	16777216	1.131030	480	QFT (0 input)
QFT	24	\|Wn>	16777216	1.358740	548	QFT (W input)
BENCH	26	\|0..0>	256	68.717700	17140	RAND (trimmed)
QFT	26	\|0..0>	16777216	1.384400	530	QFT (0 input)
QFT	26	\|Wn>	16777216	1.713760	621	QFT (W input)
BENCH	28	\|0..0>	256	94.232300	25920	RAND (trimmed)
QFT	28	\|0..0>	16777216	1.670910	607	QFT (0 input)
QFT	28	\|Wn>	16777216	2.097140	670	QFT (W input)
BENCH	30	\|0..0>	256	123.518000	31140	RAND (trimmed)
QFT	30	\|0..0>	16777216	1.966470	649	QFT (0 input)
QFT	30	\|Wn>	16777216	2.447190	851	QFT (W input)
BENCH	32	\|0..0>	256	157.958000	40530	RAND (trimmed)
QFT	32	\|0..0>	16777216	2.297540	724	QFT (0 input)
QFT	32	\|Wn>	16777216	2.878970	843	QFT (W input)
BENCH	34	\|0..0>	256	193.612000	57920	RAND (trimmed)
QFT	34	\|0..0>	16777216	2.702240	878	QFT (0 input)
QFT	34	\|Wn>	16777216	3.955050	1240	QFT (W input)
BENCH	36	\|0..0>	256	236.021000	72610	RAND (trimmed)
QFT	36	\|0..0>	16777216	3.230490	1120	QFT (0 input)
QFT	36	\|Wn>	16777216	5.529280	1610	QFT (W input)
BENCH	38	\|0..0>	256	279.868000	80810	RAND (trimmed)
QFT	38	\|0..0>	16777216	3.677020	1060	QFT (0 input)
QFT	38	\|Wn>	16777216	7.410140	2080	QFT (W input)
BENCH	40	\|0..0>	256	332.627000	93580	RAND (trimmed)
QFT	40	\|0..0>	16777216	4.413010	1280	QFT (0 input)
QFT	40	\|Wn>	16777216	10.517300	2540	QFT (W input)
BENCH	42	\|0..0>	256	402.321000	124050	RAND (trimmed)
QFT	42	\|0..0>	16777216	4.727110	1350	QFT (0 input)
QFT	42	\|Wn>	16777216	13.946100	3470	QFT (W input)
BENCH	44	\|0..0>	256	443.380000	126170	RAND (trimmed)
QFT	44	\|0..0>	16777216	5.732970	1670	QFT (0 input)
QFT	44	\|Wn>	16777216	18.032500	4530	QFT (W input)
BENCH	46	\|0..0>	256	502.356000	149490	RAND (trimmed)
QFT	46	\|0..0>	16777216	6.363610	1700	QFT (0 input)
QFT	46	\|Wn>	16777216	23.087000	5530	QFT (W input)
BENCH	48	\|0..0>	256	573.018000	172720	RAND (trimmed)
QFT	48	\|0..0>	16777216	7.198300	2090	QFT (0 input)
QFT	48	\|Wn>	16777216	28.961300	7470	QFT (W input)
BENCH	50	\|0..0>	256	650.325000	205200	RAND (trimmed)
QFT	50	\|0..0>	16777216	7.752420	1910	QFT (0 input)
QFT	50	\|Wn>	16777216	36.582000	8740	QFT (W input)
BENCH	52	\|0..0>	256	713.854000	220620	RAND (trimmed)
QFT	52	\|0..0>	16777216	9.426870	2410	QFT (0 input)
QFT	52	\|Wn>	16777216	45.634500	11640	QFT (W input)
BENCH	54	\|0..0>	256	794.648000	269100	RAND (trimmed)
QFT	54	\|0..0>	16777216	10.160100	2800	QFT (0 input)
QFT	54	\|Wn>	16777216	56.369100	13620	QFT (W input)
BENCH	56	\|0..0>	256	886.571000	291100	RAND (trimmed)
QFT	56	\|0..0>	16777216	11.736600	2720	QFT (0 input)
QFT	56	\|Wn>	16777216	69.114500	17390	QFT (W input)
BENCH	58	\|0..0>	256	962.006000	303530	RAND (trimmed)
QFT	58	\|0..0>	16777216	12.803100	3350	QFT (0 input)
QFT	58	\|Wn>	16777216	89.085500	21630	QFT (W input)
BENCH	60	\|0..0>	256	1059.990000	382820	RAND (trimmed)
QFT	60	\|0..0>	16777216	13.835500	3500	QFT (0 input)
QFT	60	\|Wn>	16777216	106.931000	24800	QFT (W input)
BENCH	62	\|0..0>	256	1154.630000	393440	RAND (trimmed)
QFT	62	\|0..0>	16777216	15.082800	3800	QFT (0 input)
QFT	62	\|Wn>	16777216	136.141000	33720	QFT (W input)
BENCH	64	\|0..0>	256	1274.880000	447260	RAND (trimmed)
QFT	64	\|0..0>	16777216	17.094300	4420	QFT (0 input)
QFT	64	\|Wn>	16777216	165.921000	40360	QFT (W input)
BENCH	66	\|0..0>	256	1358.580000	454060	RAND (trimmed)
QFT	66	\|0..0>	16777216	18.182900	4430	QFT (0 input)
QFT	66	\|Wn>	16777216	184.707000	44620	QFT (W input)
BENCH	68	\|0..0>	256	1474.950000	511090	RAND (trimmed)
QFT	68	\|0..0>	16777216	20.357600	5090	QFT (0 input)
QFT	68	\|Wn>	16777216	231.379000	56220	QFT (W input)
BENCH	70	\|0..0>	256	1579.340000	506360	RAND (trimmed)
QFT	70	\|0..0>	16777216	20.230600	4930	QFT (0 input)
QFT	70	\|Wn>	16777216	288.459000	69720	QFT (W input)
QFT	72	\|0..0>	16777216	24.923100	6400	QFT (0 input)
QFT	72	\|Wn>	16777216	346.768000	86960	QFT (W input)
QFT	74	\|0..0>	16777216	26.342600	6520	QFT (0 input)
QFT	74	\|Wn>	16777216	456.392000	109730	QFT (W input)
QFT	76	\|0..0>	16777216	27.884200	6910	QFT (0 input)
QFT	76	\|Wn>	16777216	596.460000	150470	QFT (W input)
QFT	78	\|0..0>	16777216	30.179600	7650	QFT (0 input)
QFT	78	\|Wn>	16777216	627.008000	151430	QFT (W input)
QFT	80	\|0..0>	16777216	32.779900	8300	QFT (0 input)
QFT	80	\|Wn>	16777216	817.609000	198300	QFT (W input)
QFT	82	\|0..0>	16777216	35.465700	8920	QFT (0 input)
QFT	82	\|Wn>	16777216	862.174000	194590	QFT (W input)
QFT	84	\|0..0>	16777216	37.657800	9220	QFT (0 input)
QFT	84	\|Wn>	16777216	952.616000	230260	QFT (W input)
QFT	86	\|0..0>	16777216	41.559900	10110	QFT (0 input)
QFT	86	\|Wn>	16777216	1050.530000	250290	QFT (W input)
QFT	88	\|0..0>	16777216	40.099200	10470	QFT (0 input)
QFT	88	\|Wn>	16777216	1197.190000	287880	QFT (W input)
QFT	90	\|0..0>	16777216	46.652300	11610	QFT (0 input)
QFT	90	\|Wn>	16777216	1371.380000	324020	QFT (W input)

Weak scaling

Weak scaling – for a minimum number of ARCHER2 nodes required to fit the problem, starting from 1 node for 32 qubits and doubling for every new qubit. Parallel efficiency is poor, so this is clearly the most efficient approach. An exception is that 33 qubits can fit on 1 node (no MPI sendrecv buffer). QFT performed better then RAND due to prevalent diagonal gates. As predicted, there is a significant energy consumption gap between high and medium CPU clock frequencies.

Clock Frequency Impact

Clock frequency impact for the same scaling as on the weak scaling figure. The data for runtime/energy is displayed with respect to high frequency. New highm1 frequency was added, which lies above medium.

It is clear that the runtime penalty of decreasing the frequency is lower than the boost in energy efficiency – medium setting being the most optimal. Therefore, it is a viable strategy for greener simulations, adding only a few seconds of runtime cost.

ITensor Scaling

Tensor networks allow an efficient state vector representation via a Matrix Product State (MPS). Instead of storing \(2^n\) amplitudes, each qubit is represented by a site, which is connected to others with contractable bonds of dimensions that correspond to the entanglement of the state:

low entanglement states are compact and easy to evolve/contract
high entanglement bond size explodes, but can be truncated, which approximates the state

The platform used for the experiments is ITensor. It doesn’t support any parallelism for general tensor networks, but even under those limitations it can be better than QuEST for some circuits.

QFT is a circuit that doesn’t induce much entanglement, so it can be simulated very fast. It was fed with a trivial \(\vert 00...0 \rangle\) state, and a lightly entangled \(\vert W \rangle\) state defined as:

\[\vert W \rangle = \frac{1}{\sqrt n} (\vert 100...0 \rangle + \vert 010...0 \rangle + ... + \vert 000...1 \rangle)\]

RAND is highly entangling, which makes it difficult to simulate the full state. Promising results were achieved by trimming the number of bond dimensions. On the figure above maximum bond dimension of 256 was used for the trimmed state. However, its accuracy quickly degrades for more qubits.

Evaluation

So far, we have only tested the Schrödinger and MPS algorithms thoroughly. We used QuEST for state vector simulation, as on an HPC system it proved to be much faster than Qiskit. It offers parallelisation across \(2^p\) processes, and ARCHER2 was able to simulate up to 43 qubits on 2,048 nodes. Interestingly, downclocking the CPU can save much energy with little time penalty. We are also writing our own simulator, to get rid of \(2^p\) processes restriction, which we found leaves some nodes underused. [6]

Tensor networks were tested in ITensor, which doesn’t offer message-passing parallelism, but even without it can be much faster than the Schrödinger method. As long as the entanglement in the system is low and keeps being truncated (which acts similar to noise), the runtime shows polynomial complexity. However, if we want a noiseless statevector, it performs much worse than the Schrödinger method. [7]

Conclusion

The state vector approach is stable regardless of the entanglement. Due to high communication, it is most economical when running on the minimum number of nodes to fit the problem, and reducing the CPU clock frequency to medium.

In contrast, the runtime, and thus energy consumption of tensor networks is very dependent on the entanglement. If it is low, iTensor can drastically outperform QuEST despite featuring no parallelism. However, for highly entangled states, the full simulation is very slow – but can be approximated by truncating the bond dimensions.

In the future, tensor network simulations should be parallelised with OpenMP and MPI. Then they could be directly compared against state vector methods to figure out whether and when they are advantageous, especially in case of an approximated state.

References

[1]

M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information, 10th ed. (Cambridge University Press, 2012).

[2]

F. Arute et al., Quantum Supremacy Using a Programmable Superconducting Processor, Nature 574, 505 (2019).

[3]

F. Pan and P. Zhang, Simulation of Quantum Circuits Using the Big-Batch Tensor Network Method, Physical Review Letters 128, (2022).

[4]

U. Schollwöck, The Density-Matrix Renormalization Group in the Age of Matrix Product States, Annals of Physics 326, 96 (2011).

[5]

Y. Zhou, E. M. Stoudenmire, and X. Waintal, What Limits the Simulation of Quantum Computers?, Physical Review X 10, (2020).

[6]

T. Jones, A. Brown, I. Bush, and S. C. Benjamin, QuEST and High Performance Simulation of Quantum Computers, Scientific Reports 9, (2019).

[7]

M. Fishman, S. White, and E. Stoudenmire, The ITensor Software Library for Tensor Network Calculations, SciPost Physics Codebases (2022).