Data Structures for Humans

Just as there are atomic algorithms there are also atomic data strutures. These essentially amount to different objects to store data in, each object following a conveniently designed protocol. In this post I’m going to go through the most common ones that every coder should just sort of know. Namely: linked lists, hash tables, stacks, queues and r rooted trees.

This may sound like a lot for a single post, but as we’ll see, a lot of the protocols kind of boil do to a single use case - storing data, and accessing it conveniently in order to implement different algorithms. Implementation of algorithms is key here, the design of data structures is intrinsically interesting from a maths perspective, but their design is heavily influenced by the problems you can solve by using them.

Similar to the previous post on algorithms I’m going to go through the above data structures one by one, and write a simple Python implementation of the protocol. Furthermore, as a Python programmer I’m also going to write a little bit about how Python implements these data structures as language primitives (i.e. for free!) as it doesn’t necessarily strictly adhere to the above protocols as described in CS textbooks but they do offer some convenient extras that are special to the language. My aim here is simply to make visible all the moving parts.

Stacks and Queues

Stacks

Stacks and queues are well named, and would behave as you would expect for such objects. Stacks operate under ‘Last In First Out’ (LIFO), like a stack of plates - you would get to the bottom plate by first removing the plates on top of it. Whereas queues operate under ‘First In First Out’ (FIFO), much like standing in a (well functioning) queue.

Below I’ve implemented a stack object using the Python list. Stacks implement two methods to add (push) and delete (pop) data. The pop method usually doens’t accept any arguments as Stacks by their nature follow LIFO.

class Stack:
    def __init__(self):
        """Implement using a list"""
        self._stack = []
 
    def __repr__(self):
        return str(self._stack)
        
    def empty(self):
        """Check whether stack is empty"""
        return not bool(self._stack)
    
    def push(self, v):
        """Add an item to top of stack"""
        self._stack += [v]
    
    def pop(self):
        """Remove item from top of stack"""
        self._stack = self._stack[:-1]

Python doesn’t offer stack primitives, but they do encourage the use of Python lists as stacks, as you can perform operations in a LIFO manner:

l = list([1, 2, 3, 4])

# add to top of stack
l.append(5)

# pop from top of stack
l.pop()

# lists are truthy, so can check if empty:
[]  # False
not []  # True

Queues

Again, Python doesn’t offer a primitive queue object, so I’ve implemented the basic protocol below using a list. Additionally, the delete and insert operations are called dequeue and enqueue by convention for queues.

class Queue:
    def __init__(self):
        """Implement using a list"""
        self._queue = []
        
    def empty(self):
        """Check if queue is empty"""
        return not bool(self._queue)

    def enqueue(self, v):
        """Add an item to back of the queue"""
        self._queue += [v]

    def dequeue(self):
        """Remove item from front of the queue"""
        self._queue = self._queue[0:]

Although Python doesn’t offer queue or stack primitives, it does offer a built-in implementation for FIFO/LIFO objects through the Queue module that ships with the standard library that fulfill both the simple protocols above, with some extras. It would in general be wise to use this rather than implement your own, as they will have taken care to optimise the performance of the objects in comparison to the ones I’ve implemented above.

What kind of problems can we use this technology for? Pretty much anything where you want to access data in a consistent way, for

Analysis

Stacks and Queues operations to insert and delete data are both $O(1)$. This latter statement is a result of the fact that data access occurs in a pre-defined manner (LIFO/FIFO). Therefore, when implementing an algorithm that relies on data access in some pre-defined manner, one should attempt to use a stack/queue. For example I made use of a queue to implement breadth first search, where I needed to store the next nodes to consider in order of consideration.

Linked Lists (Dynamic Arrays)

In most low level programming languages, you have to declare how much stuff can go in an array before putting in anything in by setting aside some memory when you create an array object. In doing this though, you quickly run into problems. Imagine you create an algorithm to process the results of some mathematical calculation, each time you get a result you add it to an array, however setting your calculation off you quickly eat up all the memory you’ve allocated to your initial array, and you have to create a new array object as a part of your algorithm. This could quickly get messy, with references to new arrays being created dynamically. (Singly) Linked Lists, or Dynamic Arrays, get around this problem by allowing you to extend arrays as you go. One does so by modelling each datum in the array as being stored in a node, with pointer to the next datum.

This model is pretty good, though it is a little fiddly to deal with inserting and deleting data, the basic methods for a linked list, which I’ve implemented below, are again to do with the insertion and removal of data.

class Node:
    """
    Node to store data, and pointer to next node in linked list
        as well as a pointer to the previous node.
    """
    def __init__(self, data):
        self._next = None
        self.prev = None
        self.data = data


class LinkedList:
    """Implement using Nodes and Pointers"""
    def __init__(self):
        self.head = Node(None)

    def __repr__(self):
        """Some Python sugar to print a representation"""

        x = self.head
        ll = []
        while x.data is not None:
            ll.append(x.data)
            x = x._next

        return str(ll)

    def insert(self, data):
        """Insert an element at the head of the list"""
        node = Node(data)

        node._next = self.head
        self.head.prev = node

        self.head = node
        self.head.prev = Node(None)

    def search(self, data):
        """Return the node that holds a given piece of data"""
        x = self.head

        while x.data != data:
            x = x._next

        return x

    def remove(self, data):
        """Splice out a given node"""

        x = self.search(data)

        x.prev._next = x._next
        x._next.prev = x.prev

        if x.data == self.head.data:
            self.head = x._next

This is all well and good, but luckily you probably won’t have to ever fiddle around with a similar implementation of your own, as Python has you covered with the ever helpful list. Python lists amount to dynamic arrays which you can (within reason) keep appending data to. They are fantastic, versatile objects, with many more features than my little Linked List above. Specifically, my remove method fails to account for nodes with duplicate data, and my insert method can’t place a node anywhere but at the head of the linked list. Python lists allow you to place items in a list by index, which is much more versatile than the above - which is the basic protocol for (singly) linked lists.

Analysis

Linked lists are ‘linear’, i.e. the way to search through them happens linearly - one node at a time. Therefore we can reason that the search method, and therefore the remove method are both $O(N)$, whereas the insert method is much cheaper as we always insert at the head of the list, and is $O(1)$. Therefore, these data structures are more useful when our algorithm needs to store data much more often than it needs to look it up.

Rooted Trees

We can use a similar node-pointer model to represent rooted trees, in the context of this post we’ll only consider binary trees and binary search trees. As with linked lists, we’ll assume each node contains some data, the remaining attributes are the pointers to the other nodes, which will inevitable vary with the type of rooted tree.

Binary Trees

The nodes of binary trees can only point at most to two other nodes, hence binary. We refer to nodes, and the ones they point to as ‘parents’ and ‘children’, respectively, In terms of implementation, it’s a fairly common pattern I’ve seen elsewhere to specify binary (search) trees in an abstract manner. Using just a Node object, with pointers to a given node’s parent, and two children.

class Node:
    """Implement a tree by specifying relations between nodes."""
    def __init__(self, data):
        self.parent = None
        self.left_child = None
        self.right_child = None
        self.data = data

In terms of manipulating this structure to view/insert/delete data we can implement common tree-traversal methods. Below, I’ve implemented a lookup method to find a given node in binary (search) tree using in-order traversal:

def lookup(node, value):
    """
    Lookup a given value in a binary tree, starting the search at
        a specified root node. Returns True if value is contained,
        within tree, False otherwise.
    """
    if node:
        if (
            node.data == value
            or lookup(node.left_child, value)
            or lookup(node.right_child, value)
        ):
            return True

    return False

Binary Search Trees

Binary search trees have the additional property that all child nodes to the left of a given node contain data in of smaller or equal value to the datum in that node, and all nodes to it’s right contain data of greater value. For integers this is obvious, but less so for things like strings. Where you may have to map the data you’re storing back to unique integers in order to store it. Using this we can write a simple method to insert data into Binary Search Trees.

# construct a simple BST with empty leaf nodes
n2 = Node(2)
n1 = Node(1)
n3 = Node(3)

n2.left_child = n1
n2.right_child = n3
n1.parent = n2
n3.parent = n2

n3.left_child = Node(None)
n3.right_child = Node(None)
n3.left_child.parent = n3
n3.right_child.parent = n3

n1.right_child = Node(None)
n1.left_child = Node(None)
n1.left_child.parent = n1
n1.right_child.parent = n1

 
def insert(node, value, pointer=None):
    """
    Insert data into Binary Search Trees
        Begin by looking up whether or not a node exists with this value
        in the tree. If it does not, we have to recurse down the tree to
        find an appropriate place to store the value.
    """
    if node and node.data:
        if value <= node.data:
            pointer = node.left_child
            insert(node.left_child, value, pointer)
        else:
            pointer = node.right_child
            insert(node.right_child, value, pointer)

    else:
        pointer.data = value

I’ve used a similar recursive strategy to implement insert and lookup above. This is because recursion pops out fairly naturally when manipulating trees, mostly beacuse we’ll want to consider each node of a tree in a linear manner, often based on a condition, but the data structure is itself clearly non-linear.

Python doesn’t offer any tree primitives, so are they really that useful? I think a lot people understimate the importance of trees. Famously the author of Homebrew complained about being rejected from Google for ‘failing to invert a binary tree’, which he thought was BS as he had created software that pretty much all Mac coders use. Initially I was inclined to side with him, but I read this blog post that really changed my mind on the issue.

So why doesn’t Python implement tree primitives despite them being so important in Computer Science? The truth is it does, a more complex abstraction such as a Python dictionary, or a JSON object is built on top of a tree like structure. In terms of raw trees, much like my implementation of a linked list earlier, Python’s creators probably thought it was more useful in terms of solving problems to wrap the basic abstraction inside a higher level concept, which when you think about it is. Do you really want to implement a tree traversal method, every time you want to look up an item in a dict like structure?

Analysis

My recursive tree methods above go through each node, one by one, operating on a constructed tree in-place. Therefore, they are of $O(1)$ in space, and of $O(log(N))$ in time. This comes from the number of levels in a binary tree, which can be found by using a similar argument to that of merge sort. But as a binary tree could degenerate into a linked-list, this time complexity in the worst case would be $O(N)$.

Hash Tables

Hash tables are the most effective way of representing an abstratction for dictionary like objects. i.e. where we want to read, update, and insert data by reference to some name. In theory names (or hashes) are difficult to create uniquely (search for hash collisions), but in practice most hashing algorithms are sound enough to make collisions very rare indeed. This lookup can be very efficient, $O(1)$, as long as we can guarantee unique hashes.

I’ve built a fairly contrived implementation below, using Python lists as the underlying data structure. I’ve used lists to handle potential hash collision, i.e. if a hash for a particular value corresponds to a hash for some other value, by using a list to store values you can handle collisions.

def small_hash(value):
    return hash(value) % 100

class HashTable:
    """Simple Hashtable implementation using Python lists"""

    def __init__(self):
        # buffer space
        self._table = [None for i in range(int(10e5))]
        self._values = []

    def __repr__(self):

        l = []
        for v in self._values:
            l.append((v, self._table[small_hash(v)]))

        return str(l)

    def __setitem__(self, value, item):
        key = small_hash(value)

        if value not in self._values:
            self._values.append(value)

        if not self._table[key]:
            self._table[key] = [] 

        self._table[key].add(item)

    def __getitem__(self, value):
        key = small_hash(value)

        return self._table[key]

In reality you would never bother implementing the above code in Python as dictionaries are primitives in the language, and in many other languages both interpreted and compiled.

Analysis

If you’re hashing algorithm is decent, you should be able to perform lookups/inserts in $O(1)$ on average.

Conclusion

Data structures are a very deep topic, and I’ve scratched the surface here with simple implementations, of fairly simple data structures. For example, the hash table I’ve implemented is one in which I’ve not optimised the hash function to reduce collisions, or considered really the type of problem I’d wish to solve (and hence the underlying data structure it should rely on). However, it’s a good start for anyone interesting in studying the topic, and CLRS is really the place to supplemement this post.