Speed comparison among numpy, cython, numba and tensorflow 2.0
Recently I have been working on speeding up some codes in pymatgen for finding the atomic neighbors within a cutoff radius. I was searching online and found that
cython is a rather powerful tool for accelerating python loops, and decided to give it a try.
A common comparison for
numba and I have heard many good things about it. A less common competitor is the recently released
tensorflow 2.0. In fact, back in the
tensorflow 1.x era, I did some simple comparisons and found that the speed was in fact faster than
numpy. The new
tensorflow 2.0 is boasted to be 3x faster than
tensorflow 1.x, and it makes me wonder how faster would
tensorflow 2.0 be for some simple computing tasks.
Function decorate to record time
I like to do simple things myself so that I know what exactly happens in the code. So I am writing a timeit decorator instead of using
from time import time import functools def timeit(n=10): """ Decorator to run function n times and print out the total time elapsed. """ def dec(func): @functools.wraps(func) def wrapped(*args, **kwargs): t0 = time() for i in range(n): func(*args, **kwargs) print("%s iterated %d times\nTime elapsed %.3fs\n" % ( func.__name__, n, time() - t0)) return wrapped return dec
Computing functions using different methods
Here I am computing
\[matrix[i, j] = i^2 + j^2\]
for a matrix of size
# import numba, tensorflow and numpy, load cython import numba import tensorflow as tf import numpy as np %load_ext cython
@tf.function def compute_tf(m, n): print('Tracing ', m, n) x1 = tf.range(0, m-1, 1) ** 2 x2 = tf.range(0, n-1, 1) ** 2 return x1[:, None] + x2[None, :] compute_tf(tf.constant(1), tf.constant(1)) # trace once
Tracing Tensor("m:0", shape=(), dtype=int32) Tensor("n:0", shape=(), dtype=int32) <tf.Tensor: id=261, shape=(0, 0), dtype=int32, numpy=array(, shape=(0, 0), dtype=int32)>
I used the
tf.function decorate to define the graph and avoided repeated tracing the graph by using
tf.constant as input and perform the initial graph tracing. You will see that running this function will not invoke the
def compute_numpy(m, n): x1 = np.linspace(0., m-1, m) ** 2 x2 = np.linspace(0., n-1, n) ** 2 return x1[:, None] + x2[None, :]
@numba.njit def compute_numba(m, n): x = np.empty((m, n)) for i in range(m): for j in range(n): x[i, j] = i**2 + j**2 return x compute_numba(1, 1) # JIT compile first
@numba.njit(parallel=True) def compute_numba_parallel(m, n): x = np.empty((m, n)) for i in numba.prange(m): for j in numba.prange(n): x[i, j] = i**2 + j**2 return x compute_numba_parallel(1, 1) # JIT compile first
Numpy and numba are almost the same.
numba is really handy in terms of turning on parallel computations.
%%cython cimport cython import numpy as np cimport numpy as np @cython.boundscheck(False) @cython.wraparound(False) def compute_cython(int m, int n): cdef long [:, ::1] x = np.empty((m, n), dtype=int) cdef int i, j for i in range(m): for j in range(n): x[i, j] = i*i +j*j return x
cython needs more work and i am delegating the memory management to
numpy here and use
memoryview x. Basically it is like
C. Note that
cython can also turn on parallel computations like
numba by using
cython.parallel.prange. However it does require
openmp, which does not ship with
clang compiler in macos. So I am not testing the parallel version here.
m = 2000 n = 10000 n_loop = 10 timeit(n=n_loop)(compute_numpy)(m, n) timeit(n=n_loop)(compute_numba)(m, n) timeit(n=n_loop)(compute_numba_parallel)(m, n) timeit(n=n_loop)(compute_cython)(m, n) timeit(n=n_loop)(compute_tf)(tf.constant(m), tf.constant(n))
compute_numpy iterated 10 times Time elapsed 0.971s compute_numba iterated 10 times Time elapsed 1.110s compute_numba_parallel iterated 10 times Time elapsed 0.651s compute_cython iterated 10 times Time elapsed 1.098s compute_tf iterated 10 times Time elapsed 0.190s
Tensorflow 2.0 is amazing.