Faster ways to compute sum of an array
(Code available at harningle/useful-scripts)
When I wrote the code for synthetic control, I found an interesting phenomenon: (arr ** 2).sum()
is sometimes slower than arr.T @ arr
. They both calculate the sum of squares, but maybe the underlying implementations are different. Without going into the details, I run some benchmark here to see what is the fastest way to compute such sums.
Simple sum
Let’s begin with something easier: add up all elements in an array. Say $arr$ is an array of $n$ numbers, and our target is simply $arr_0 + arr_1 + \cdots + arr_{n - 1}$. In addition to the trivial np.sum()
, this addition can be viewed as a dot product:
. Now the question is, is this funny dot product np.dot
faster than np.sum
? Theoretically, the performance can be different: one is addition, while the other is matrix/vector multiplication.1 The figure below shows the speed of different ways to compute the sum of arrays of different sizes. Python native sum
is quite ok when the array is small, as it has no overhead at all relative to NumPy. Between np.sum
and np.dot
there is a persistent gap, with np.dot
always being a bit faster.2 The real killing package is CuPy, which utilises the GPU. It almost does the sum in constant time!
Sum of squares
Now back to the original question, to compute sum of squares, is np.sum(arr ** 2)
faster, or arr.T @ arr
faster? After seeing the results above, we expect the matrix multiplication being faster. And our results below do confirm this. In fact here we have a third way to compute this: sum of squares is basically $\ell^2$ norm, so we can use the linear algebra method np.linalg.norm
as well. It’s a bit weird that for small arrays, np.linalg.norm
performs like np.sum
, and when the array gets larger, its speed is closer to np.dot
. Not sure if it’s just noise or some fundamental difference. And again, CuPy is sooooo fast and almost does this in constant time.
-
A very good explanation from Stack Overflow: “
np.dot
delegates to a BLAS vector-vector multiply, whilenp.sum
uses a pairwise summation routine, switching over to an 8x unrolled summation loop at a block size of 128 elements.” I also noticed OpenBLAS NumPy and MKL NumPy have some performance difference. Looked into that in detail in another post. ↩ -
In order to do dot product, we need to create a column vector of ones. This creation time is not included here. If it’s included, then of course
np.dot
is slower thannp.sum
as it needs to navigate through the memory to create this column vector. ↩