High Performance Python?

Why do scientific computing with Python when you have to pass everything through the interpreter?

Firstly, Python remains incredibly popular, and is gaining ground as the most common first language of instruction in universities and schools [2]. Even my own undergrad introduction to computing was in Python, back in 2014.

So we have a generation of technical people (scientists, engineers, mathematicians, economists etc) who are Python natives. Having been exposed to it from a young age, and are comfortable with it. This has undoubtedly lead to the explosion in open-source for Python.

This is both good and bad, Python has many idiosyncrasies and design features which are now learned wrote by many people who program computers to solve specific tasks. However, it also means that there are huge spectrum of things you can do with just Python, from machine learning, to deploying distributed applications on the internet.

Enter Numba. Numba is a just-in-time [JIT] compiler for Python, and optimises computations on ndarrays, the container type used in the Numpy library. Specifically, Numba can detect and convert layers of indirection instructions into load/store instructions from registers. This makes the runtime code a lot simpler when running computations on a lot of arrays. This also means that there is a compile time hit to performance, however as compilation is ‘just in time’, Numba only compiles if and only if your accelerated function is used at runtime.

Most usefully, Numba integrates seamlessly with CuPy for GPU development and offers CPU multithreading support through a prange statement, that’s similar in behaviour to OpenMP’s parallel for loops. Furthermore, Numba automatically SIMD vectorizes your code, specific to your hardware. So Numba can be used, in theory, to develop heterogenous applications from a single Python source. Making it easier than ever for domain specialists to achieve high performance, and take advantage of the hardware of their computer.

Example

How does it work? It’s installed via a package manager, and users interface to it via a simple decorator.

import numba

@numba.njit
def accumulate(a):

	c = 0
	for i in range(len(a)):
		c += a[i]
	return c

This function (excluding compilation time) is nearly 400 times faster (240 ns vs 92 $\mu$s) on my i7 CPU when run with a=np.arange(1000).

Pitfalls

This sounds great, but but one must remember that Numba is not Python. To achieve performance, one has to ensure that your code can compile to something that is vectorisable by Numba. If not, Numba could potentially be slower than ordinary Python. Consider the following example, which is slightly contrived,

@numba.njit
def hasher(a):
    for i in range(len(a)):
        hash(a)
    return hash(a)

Running without Numba takes 277 ns, compared to 1.4$\mu$s with Numba, where a='foo'. This is an example of an operation that is valid Python, however outside of Numba’s declared remit. You see similar behaviour if you try and do a lot of data manipulation/allocation.

Over the past year or so, I’ve been developing PyExaFMM, a Python library for simulating the Fast Multipole Method using Numba as a backend language. I’m now in the middle of writing a paper about my experiences of Numba development, and its pitfalls in developing more complex software. Overall it’s a great tool, and I’d highly recommend it for developers looking for scientific scale performance, despite it’s pitfalls. It has it’s own learning curve, and is another great example of Python open source.

Numba for Hackers

High Performance Python?

Example

Pitfalls