Faster ways to compute sum of an array

(Code available at harningle/useful-scripts)

When I wrote the code for synthetic control, I found an interesting phenomenon: (arr ** 2).sum() is sometimes slower than arr.T @ arr. They both calculate the sum of squares, but maybe the underlying implementations are different. Without going into the details, I run some benchmark here to see what is the fastest way to compute such sums.

Simple sum

Let’s begin with something easier: add up all elements in an array. Say $arr$ is an array of $n$ numbers, and our target is simply $arr_0 + arr_1 + \cdots + arr_{n - 1}$. In addition to the trivial np.sum(), this addition can be viewed as a dot product:

\[(arr_0, arr_1, \cdots, arr_{n - 1}) \cdot \begin{pmatrix} 1 \\ 1 \\ \cdots \\ 1\\ \end{pmatrix}\]

. Now the question is, is this funny dot product np.dot faster than np.sum? Theoretically, the performance can be different: one is addition, while the other is matrix/vector multiplication.¹ The figure below shows the speed of different ways to compute the sum of arrays of different sizes. Python native sum is quite ok when the array is small, as it has no overhead at all relative to NumPy. Between np.sum and np.dot there is a persistent gap, with np.dot always being a bit faster.² The real killing package is CuPy, which utilises the GPU. It almost does the sum in constant time!

Notes: The time is averaged across 20 runs using timeit. I use NumPy built against MKL from conda defaults channel. CPU is i7-7800X and GPU is RTX 3060 12GB. This note applies to all figures.

Sum of squares

Now back to the original question, to compute sum of squares, is np.sum(arr ** 2) faster, or arr.T @ arr faster? After seeing the results above, we expect the matrix multiplication being faster. And our results below do confirm this. In fact here we have a third way to compute this: sum of squares is basically $\ell^2$ norm, so we can use the linear algebra method np.linalg.norm as well. It’s a bit weird that for small arrays, np.linalg.norm performs like np.sum, and when the array gets larger, its speed is closer to np.dot. Not sure if it’s just noise or some fundamental difference. And again, CuPy is sooooo fast and almost does this in constant time.

A very good explanation from Stack Overflow: “np.dot delegates to a BLAS vector-vector multiply, while np.sum uses a pairwise summation routine, switching over to an 8x unrolled summation loop at a block size of 128 elements.” I also noticed OpenBLAS NumPy and MKL NumPy have some performance difference. Looked into that in detail in another post. ↩
In order to do dot product, we need to create a column vector of ones. This creation time is not included here. If it’s included, then of course np.dot is slower than np.sum as it needs to navigate through the memory to create this column vector. ↩