Speed comparison among numpy, cython, numba and tensorflow 2.0

Recently I have been working on speeding up some codes in pymatgen for finding the atomic neighbors within a cutoff radius. I was searching online and found that cython is a rather powerful tool for accelerating python loops, and decided to give it a try.

A common comparison for cython is numba and I have heard many good things about it. A less common competitor is the recently released tensorflow 2.0. In fact, back in the tensorflow 1.x era, I did some simple comparisons and found that the speed was in fact faster than numpy. The new tensorflow 2.0 is boasted to be 3x faster than tensorflow 1.x, and it makes me wonder how faster would tensorflow 2.0 be for some simple computing tasks.

Function decorate to record time

I like to do simple things myself so that I know what exactly happens in the code. So I am writing a timeit decorator instead of using timeit package.

from time import time
import functools

def timeit(n=10):
    """
    Decorator to run function n times and print out the total time elapsed.
    """
    def dec(func):
        @functools.wraps(func)
        def wrapped(*args, **kwargs):
            t0 = time()
            for i in range(n):
                func(*args, **kwargs)
            print("%s iterated %d times\nTime elapsed %.3fs\n" % (
                func.__name__, n, time() - t0))
        return wrapped
    return dec

Computing functions using different methods

Here I am computing

\[matrix[i, j] = i^2 + j^2\]

for a matrix of size [m, n]

# import numba, tensorflow and numpy, load cython
import numba
import tensorflow as tf
import numpy as np

%load_ext cython
@tf.function
def compute_tf(m, n):
    print('Tracing ',  m,  n)
    x1 = tf.range(0, m-1, 1) ** 2
    x2 = tf.range(0, n-1, 1) ** 2
    return x1[:, None] + x2[None, :]

compute_tf(tf.constant(1), tf.constant(1)) # trace once
Tracing  Tensor("m:0", shape=(), dtype=int32) Tensor("n:0", shape=(), dtype=int32)
<tf.Tensor: id=261, shape=(0, 0), dtype=int32, numpy=array([], shape=(0, 0), dtype=int32)>

I used the tf.function decorate to define the graph and avoided repeated tracing the graph by using tf.constant as input and perform the initial graph tracing. You will see that running this function will not invoke the print function. It is only traced once

def compute_numpy(m, n):
    x1 = np.linspace(0., m-1, m) ** 2
    x2 = np.linspace(0., n-1, n) ** 2
    return x1[:, None] + x2[None, :]
@numba.njit
def compute_numba(m, n):
    x = np.empty((m, n))
    for i in range(m):
        for j in range(n):
            x[i, j] =  i**2 + j**2
    return x

compute_numba(1, 1) # JIT compile first
@numba.njit(parallel=True)
def compute_numba_parallel(m, n):
    x = np.empty((m, n))
    for i in numba.prange(m):
        for j in numba.prange(n):
            x[i, j] =  i**2 + j**2
    return x
compute_numba_parallel(1, 1) # JIT compile first
array([[0.]])

Numpy and numba are almost the same. numba is really handy in terms of turning on parallel computations.

%%cython
cimport cython
import numpy as np
cimport numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def compute_cython(int m, int n):
    cdef long [:, ::1] x = np.empty((m, n), dtype=int)
    cdef int i, j
    for i in range(m):
        for j in range(n):
            x[i, j] = i*i +j*j
    return x

cython needs more work and i am delegating the memory management to numpy here and use memoryview x. Basically it is like C. Note that cython can also turn on parallel computations like numba by using cython.parallel.prange. However it does require openmp, which does not ship with clang compiler in macos. So I am not testing the parallel version here.

Results

m = 2000
n = 10000
n_loop = 10

timeit(n=n_loop)(compute_numpy)(m, n)
timeit(n=n_loop)(compute_numba)(m, n)
timeit(n=n_loop)(compute_numba_parallel)(m, n)
timeit(n=n_loop)(compute_cython)(m, n)
timeit(n=n_loop)(compute_tf)(tf.constant(m), tf.constant(n))
compute_numpy iterated 10 times
Time elapsed 0.971s

compute_numba iterated 10 times
Time elapsed 1.110s

compute_numba_parallel iterated 10 times
Time elapsed 0.651s

compute_cython iterated 10 times
Time elapsed 1.098s

compute_tf iterated 10 times
Time elapsed 0.190s

Conclusion

Tensorflow 2.0 is amazing.