Speed comparison among numpy, cython, numba and tensorflow 2.0
Recently I have been working on speeding up some codes in pymatgen for finding the atomic neighbors within a cutoff radius. I was searching online and found that cython
is a rather powerful tool for accelerating python loops, and decided to give it a try.
A common comparison for cython
is numba
and I have heard many good things about it. A less common competitor is the recently released tensorflow 2.0
. In fact, back in the tensorflow 1.x
era, I did some simple comparisons and found that the speed was in fact faster than numpy
. The new tensorflow 2.0
is boasted to be 3x faster than tensorflow 1.x
, and it makes me wonder how faster would tensorflow 2.0
be for some simple computing tasks.
Function decorate to record time
I like to do simple things myself so that I know what exactly happens in the code. So I am writing a timeit decorator instead of using timeit
package.
from time import time
import functools
def timeit(n=10):
"""
Decorator to run function n times and print out the total time elapsed.
"""
def dec(func):
@functools.wraps(func)
def wrapped(*args, **kwargs):
t0 = time()
for i in range(n):
func(*args, **kwargs)
print("%s iterated %d times\nTime elapsed %.3fs\n" % (
func.__name__, n, time() - t0))
return wrapped
return dec
Computing functions using different methods
Here I am computing
\[matrix[i, j] = i^2 + j^2\]
for a matrix of size [m, n]
# import numba, tensorflow and numpy, load cython
import numba
import tensorflow as tf
import numpy as np
%load_ext cython
@tf.function
def compute_tf(m, n):
print('Tracing ', m, n)
x1 = tf.range(0, m-1, 1) ** 2
x2 = tf.range(0, n-1, 1) ** 2
return x1[:, None] + x2[None, :]
compute_tf(tf.constant(1), tf.constant(1)) # trace once
Tracing Tensor("m:0", shape=(), dtype=int32) Tensor("n:0", shape=(), dtype=int32)
<tf.Tensor: id=261, shape=(0, 0), dtype=int32, numpy=array([], shape=(0, 0), dtype=int32)>
I used the tf.function
decorate to define the graph and avoided repeated tracing the graph by using tf.constant
as input and perform the initial graph tracing. You will see that running this function will not invoke the print
function. It is only traced once
def compute_numpy(m, n):
x1 = np.linspace(0., m-1, m) ** 2
x2 = np.linspace(0., n-1, n) ** 2
return x1[:, None] + x2[None, :]
@numba.njit
def compute_numba(m, n):
x = np.empty((m, n))
for i in range(m):
for j in range(n):
x[i, j] = i**2 + j**2
return x
compute_numba(1, 1) # JIT compile first
@numba.njit(parallel=True)
def compute_numba_parallel(m, n):
x = np.empty((m, n))
for i in numba.prange(m):
for j in numba.prange(n):
x[i, j] = i**2 + j**2
return x
compute_numba_parallel(1, 1) # JIT compile first
array([[0.]])
Numpy and numba are almost the same. numba
is really handy in terms of turning on parallel computations.
%%cython
cimport cython
import numpy as np
cimport numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def compute_cython(int m, int n):
cdef long [:, ::1] x = np.empty((m, n), dtype=int)
cdef int i, j
for i in range(m):
for j in range(n):
x[i, j] = i*i +j*j
return x
cython
needs more work and i am delegating the memory management to numpy
here and use memoryview
x. Basically it is like C
. Note that cython
can also turn on parallel computations like numba
by using cython.parallel.prange
. However it does require openmp
, which does not ship with clang
compiler in macos. So I am not testing the parallel version here.
Results
m = 2000
n = 10000
n_loop = 10
timeit(n=n_loop)(compute_numpy)(m, n)
timeit(n=n_loop)(compute_numba)(m, n)
timeit(n=n_loop)(compute_numba_parallel)(m, n)
timeit(n=n_loop)(compute_cython)(m, n)
timeit(n=n_loop)(compute_tf)(tf.constant(m), tf.constant(n))
compute_numpy iterated 10 times
Time elapsed 0.971s
compute_numba iterated 10 times
Time elapsed 1.110s
compute_numba_parallel iterated 10 times
Time elapsed 0.651s
compute_cython iterated 10 times
Time elapsed 1.098s
compute_tf iterated 10 times
Time elapsed 0.190s
Conclusion
Tensorflow 2.0
is amazing.