Full credit to googleguy, who can be found on Freenode IRC channel ##php (sometimes).
Imagine you’re a phone company and you want to implement an automated system on your website for assigning phone numbers for your customers. There are two use cases. One, is that the customer can ask if a specific number is available. The other, is that the customer can ask for any random number that’s available. What data structure would you use to efficiently store and search these phone numbers? Keep in mind that you could potentially have billions of these phone numbers (1011 if we include international numbers).
If you use an Array
, access is O(1)
so it seems like a good idea at first, but that’s only in cases where you know the index of the value you want to access. Finding that index means searching the entire array, which is really O(n)
. If you use a Hash Table
, the cost of search is O(1)
so it seems like a better idea, but the cost of deletion and insertion can be O(n)
in the worst case scenario (where you have 100% collision).
The best answer is to use a BST or Binary Search Tree, because they have average O(log n)
for search, insertion, and deletion. That is assuming it’s a self-balancing tree implementation.
If you’re not at all familiar with Binary Search Trees or Binary Trees in general, let me give you a simple analogy to try and grasp this particular concept of search in computer science.
Let’s say a friend of yours is stuck on the road due to car trouble. Like a good friend you go to pick them up. You know they are stuck at a fork on this road and you arrive at a fork. At the fork there is a sign with a number on it. The number is 5 and your friend sees a sign with the number 16. You know that if you go left, you will only find numbers less than 5 and if you go right you will find numbers greater than 5. So you go right and arrive at another fork. This time you see a sign with the number 11. What do you do?
Knowing how to get to your friend, on this road, is no different than knowing how to search a (BST). You must always know two things, and always be able to make one of two simple choices. Is this number more than or less than the number I’m searching for? Do I go left, or do I go right at the fork?
Learning Opportunity
A BST is a data structure made up of nodes (the forks in our road analogy) which have left and right branches that, may or may not, each have an attached node (i.e there could be a dead end). A node attached to the left branch is less than the current node. A node attached to the right branch is greater than the current node. This always holds true no matter which node you’re currently looking at. The entirety of the structure makes up what looks like a tree, given all these nodes connected by branches. The fact that there can only be two branches on each node is what makes it binary. In the context of search these distinct properties are what make a BST very efficient.
The idea here is that you can always discard one-half of the tree with every iteration. This makes it possible to search for any value in the tree logarithmically, i.e. O(log n)
. Consequently, the cost of insertion or deletion to and from the tree are the same.
So say we have 9 random integers in a BST and we want to find out if the integer 16 is one of them. You always start at the root node of the tree and compare the given value to the search value. If the node’s value is greater than the search value we move to the node in the left branch. If the node’s value is less than the search value we move to the node in the right branch. Repeat this step until you find the value or reach the end (a null branch).
In PHP, we know that arrays are really just ordered hashmaps, and the cost of search in a hashmap is O(1)
, which makes it much faster than O(log n)
. However, BSTs have other properties that make them more desirable than a hashmap in many use cases. For example, finding the min or max value in a hashmap is O(n)
whereas in a BST it is O(log n)
. In a real vector, or contiguous block, the values would need to be sorted before search can be O(1)
, but Quicksort requires O(n log n)
making it very inefficient for insertion and deletion, because you have to resort each time you insert or delete.
The best way to learn is to just roll up your sleeves and do it, right? So let’s implement a BST in PHP and learn something!
We already know that a binary search tree is made up of nodes. The properties that best describe these nodes is that they have a value
, a left
branch, and a right
branch. Each branch can also contain a node. So let’s write up the node
class first since it’s simply described by a set of properties and doesn’t have any real behavior (i.e. methods).
class Node
{
public $value, $left, $right;
public function __construct($value)
{
$this->value = $value;
}
}
Simple enough. Now that we have a way to create a node, we’ll need the actual binary search tree implementation to be able to create the tree. So far we understand that a BST has a root
node, so that will be one of the properties that describe this object.
The BST also has a couple of behaviors, that will make up the methods of this class. We’ll start by implementing the simplest of these behaviors, which is search
.
class BST
{
public $root;
public function __construct($value = null)
{
if ($value !== null) {
$this->root = new Node($value);
}
}
public function search($value)
{
$node = $this->root;
while($node) {
if ($value > $node->value) {
$node = $node->right;
} elseif ($value < $node->value) {
$node = $node->left;
} else {
break;
}
}
return $node;
}
}
As you can see search is quite simple. We merely compare the search value to the current node (starting at the root node). If it’s greater than the value of the current node, we go to the node in the right branch. If it’s less than the value of the current node we go to the node in the left branch. If they’re equal we return the current node. This repeats in the loop until we reach a branch that is null
. This tells us that the value does not exist in the tree.
Now let’s implement the insert
functionality of the tree in order to make it really useful. This is very similar to our search behavior, with one difference. We’re looking for a null branch to insert the value. So ideally, our insert function should look fairly similar to our search function with the minor difference being that we will attach a new node to the null branch reached at the end of our loop.
public function insert($value)
{
$node = $this->root;
if (!$node) {
return $this->root = new Node($value);
}
while($node) {
if ($value > $node->value) {
if ($node->right) {
$node = $node->right;
} else {
$node = $node->right = new Node($value);
break;
}
} elseif ($value < $node->value) {
if ($node->left) {
$node = $node->left;
} else {
$node = $node->left = new Node($value);
break;
}
} else {
break;
}
}
return $node;
}
OK, so what’s going on here? Every time we go right or left we check to see if the branch is null
before proceeding and based on that information we make a decision to either attach a new node to the branch and break from the loop, or move to the next node. So insertion happens just before the point where we would normally end up with null
and break from the loop, but since we can’t attach a node to null
, we attach just before the $node
is assigned null, and then break.
Before we can learn about deleting nodes from the tree we must first grasp an important concept of binary search trees, which is how to find their minimum and maximum values. Remember that every time you go left in a BST you are guaranteed to get a value smaller than its parent. So if we start at the root node, and keep going left until we arrive at a node with a null
left branch, we should have reached the smallest value in the tree.
So our implementation of min
in the BST class should look something like this.
public function min()
{
if (!$this->root) {
throw new Exception('Tree root is empty!');
}
$node = $this->root;
return $node->min();
}
Notice that BST::min()
will only call on root->min()
so we must implement the actual functionality in our Node
class. This makes sense since we will likely find it useful to find the minimum value from a specific branch in the tree and not just from the root.
So our implementation of min
in the Node class should look something like this.
public function min()
{
$node = $this;
while ($node->left) {
if (!$node->left) {
break;
}
$node = $node->left;
}
return $node;
}
Notice that if you start at the root and keep going left in our example tree illustrated earlier, you will ultimately reach the smallest value in the tree, but consequently, if you were to start at any other node in the tree and keep going left until you arrive at a null left branch, you will have the smallest value in that branch.
You can probably guess how the implementation for max
will work in our BST class…
public function max()
{
if (!$this->root) {
throw new Exception('Tree root is empty!');
}
$node = $this->root;
return $node->max();
}
In our Node class the same concept applies, but we’re going right instead of left.
public function max()
{
$node = $this;
while ($node->right) {
if (!$node->right) {
break;
}
$node = $node->right;
}
return $node;
}
Great, now we have a way to insert nodes into the tree and a way to search the tree. Next, we need to have a way to delete nodes from the tree. Deletion is slightly trickier than search and insertion, but that’s OK, because we’re super ninja programmers that can code our way out of any problem, right?
Crickets…
There are three kinds of deletion in a binary search tree we should be aware of, because they’re each handled just a little differently.
- Deleting a node that has no children
- Deleting a node that has one child node
- Deleting a node that has two child nodes
The easiest is the first one with no nodes, obviously. You simply just delete the node by assigning null
to its parent branch, which means you must keep track of the parent branch. So in our earlier BST, if we decided to delete the node with the value of 1
, its parent node 2
, would then have a left branch of null
and the node would be removed from the tree.
Now, let’s say we want to delete the node with the value 2
, which at this point only has one branch (containing the node with the value 4
). It doesn’t actually matter if the child’s branch is right or left. As long as it only has one branch we simply take the node in that branch and attach it to the parent branch of the node we’re about to delete.
Finally, the last case is when we wish to delete a node that has both branches occupied. To do this we must first find the smallest node of its right branch. We can do this using the node’s min method that we implemented earlier.
Now we assign that min value to value of the node we’re trying to delete, which will give us duplicate nodes.
Next we apply a delete operation to the duplicate node.
Now, to make this concept a little easier to implement we can make one small modification to our Node class. By allowing each node to keep track of its parent node, we never need to pass along the parent by keeping track of it from the calling context. This will simplify the last and most complex case of deletion for us quite a bit. It will also make it easier to walk the tree later.
class Node
{
public $parent, $value, $left, $right;
public function __construct($value, Node $parent = null)
{
$this->value = $value;
$this->parent = $parent;
}
}
Now we just need to modify the insert
method of our BST class to pass the parent into the Node’s constructor when doing insertion.
What this gives us is a doubly-linked list of nodes. Making it possible for a child node, in the tree, to know about its parent node. This is also a recursive relationship, because the child stores the parent property as the parent node object itself. So, be careful of (recursion there) recursion there.
public function insert($value)
{
$node = $this->root;
if (!$node) {
return $this->root = new Node($value);
}
while($node) {
if ($value > $node->value) {
if ($node->right) {
$node = $node->right;
} else {
$node = $node->right = new Node($value, $node);
break;
}
} elseif ($value < $node->value) {
if ($node->left) {
$node = $node->left;
} else {
$node = $node->left = new Node($value, $node);
break;
}
} else {
break;
}
}
return $node;
}
The root
will have a null
parent, and that’s fine. So nothing else should change in our implementation. We’re now ready to implement delete in our BST class.
public function delete($value)
{
$node = $this->search($value);
if ($node) {
$node->delete();
}
}
The delete
method in our BST class is rather simple. It just searches for the node we’re trying to delete, and calls the delete
method on that Node. The Node should be able to take care of the rest from here.
public function delete()
{
if ($this->left && $this->right) {
$min = $this->right->min();
$this->value = $min->value;
$min->delete();
} elseif ($this->right) {
if ($this->parent->left === $this) {
$this->parent->left = $this->right;
$this->right->parent = $this->parent->left;
} elseif ($this->parent->right === $this) {
$this->parent->right = $this->right;
$this->right->parent = $this->parent->right;
}
$this->parent = null;
$this->right = null;
} elseif ($this->left) {
if ($this->parent->left === $this) {
$this->parent->left = $this->left;
$this->left->parent = $this->parent->left;
} elseif ($this->parent->right === $this) {
$this->parent->right = $this->left;
$this->left->parent = $this->parent->right;
}
$this->parent = null;
$this->left = null;
} else {
if ($this->parent->right === $this) {
$this->parent->right = null;
} elseif ($this->parent->left === $this) {
$this->parent->left = null;
}
$this->parent = null;
}
}
Our Node::delete()
, on the other hand, is not so simple.
The first if
condition, in this method, is the number 3 scenario on our list, mentioned earlier. When the node we’re trying to delete has both a left and right branch, we must first find the min
of its right branch and set its value to the value of that node. Then we call delete on that node. Notice, once we do that the min
node cannot meet the first condition again. It will only fall into one of the other two scenarios—it either has one child or no children at all (because it will never have a left branch).
The animation here, highlights the most complex of deletion cases, where we need to perform a delete on a node that has both left and right branches occupied.
Now, we’ll look at how you can walk through the binary search tree and some of the different ways you can do it.
One thing you should note about tree structures like the binary search tree is that they are not linear structures, like vectors, or linked lists. They can not be viewed sequentially. They are actually traversed as graphs, which means there is more than one way to traverse them. However, there must be a systematic way in which you can visit every node in the tree exactly once. Otherwise, the implementation is considered inefficient and potentially defective. Imagine you wanted to print all of the node values in the tree and your method allowed some nodes to be visited more than once. This wouldn’t be very helpful.
There are two conceptual ways in which you can traverse a binary tree. One way is iterative, meaning it works sequentially (one after the other), but requires a little more memory. The other way is to do it recursively, and since a binary tree is by definition is a corecursive structure, that would make sense. Doing it iteratively requires storing or deferring some nodes to memory before traversing them. This is usually accomplished by using a stack or queue data structure. I actually find it easier to implement a walk
method on a tree recursively, but that’s just me. Then again, I’m the one writing this so, what do I care how you like to do it?!
Now whether the method of implementation is recursive or iterative, you still have to choose the order of traversal on the tree, which can be either Depth First or Breadth First. Breadth First order is also sometimes referred to as level order, because you visit every node on a particular level of the tree before moving down to the next level.
In Depth First order there are three different algorithms commonly used for traversal.
- Pre-order
- In-order
- Post-order
We’re not going to go through all of these approaches here. In this article, we’re just going to use Depth First In-order traversal, since I find it to make the most sense for demonstration purposes.
First, let’s plot out our chartered course so that I have some high-level view of where we want to go, before we try to get there.
The sequence looks kind of like a roller coaster ride, but this particular aproach will give us the nodes in order.
So the result is all values in ascending order as [1, 2, 4, 5, 7, 11, 16, 23, 34]
, which is what you’d expect if this was a sorted array, for example.
public function walk(Node $node = null)
{
if (!$node) {
$node = $this->root;
}
if (!$node) {
return false;
}
if ($node->left) {
yield from $this->walk($node->left);
}
yield $node;
if ($node->right) {
yield from $this->walk($node->right);
}
}
I’ve used a generator implementation, which will only work in PHP 7, but that’s just because it makes the implementation so much simpler. You can accomplish the same thing in PHP 5 by replacing the yield
calls with an array assignment, for example, and then using array_reduce
to return the actual results back to the caller. You might need a wrapper function for this, but the concept is the same. It’s just so much less code using a generator!
If we ran this code…
$bst = new BST(5);
$bst->insert(2);
$bst->insert(1);
$bst->insert(4);
$bst->insert(11);
$bst->insert(7);
$bst->insert(23);
$bst->insert(16);
$bst->insert(34);
$tree = $bst->walk();
foreach($tree as $node) {
echo "{$node->value}\n";
}
We should get output like this…
1
2
4
5
7
11
16
23
34
So, you’re probably thinking, what on earth would I need a binary tree for, and especially in a language like PHP? And you’re probably very right for thinking that, but don’t confuse what you’ll need it for with why you almost never need to implement a binary tree in practice. In PHP, particularly, the performance gains of binary search are dwarfed by the context-switching overhead of having to implement them as objects (it takes a lot longer to carry out a function call in PHP than it would to do simple pointer dereferencing in a language like C, for example).
It’s very important to know about and understand this data structure in concert with other data structures so that you’re aware of the performance (time and memory) characteristics of each in a given problem. In many cases, you’ll be using tools that implement a btree quite frequently, but will rarely be required to implement one yourself. For example, your Database Management System (DBMS) implements a type of btree commonly referred to as B+ tree. In 3D computer graphics, many games will often implement a kind of btree usually called a Binary Space Parition (BSP), which allows for recursively subdividing a space into convex sets by hyperplanes.
There are many areas where trees or graphs (nodes and edges) are generally useful in a wide array of programming problems. So having a good grasp of BST, is not a bad start!
As mentioned in the beginning of this article, only a self-balancing implementation of a BST, such as AVL-tree will accomplish average O(log n)
in search, insertion, and deletion, because as you can see by our implementation if I were to simply insert the numbers [1,2,3,4,5]
or [5,4,3,2,1]
you would have O(n)
since the tree would be entirely right or left heavy.
So, as always, I like to leave something up to the imagination. So I’ll leave it to you to play around with the implementation and see if you can figure out how to implement a balance
method that will automatically balance the tree.