B+ Tree vs Hash Index
Indexes are the fundamental unit of database performance. There are many index types, but the two most common are the B+ Tree (probably the most common) and the hash index.
Intro to each index type
B+ Tree index
A B+ Tree is a tree data structure with some interesting characteristics that make it great for fast lookups with relatively few disk IOs.
- A B+ Tree can (and should) have many more than 2 children per node.
- A B+ Tree is self balancing.
- A B+ Tree holds keys in internal nodes, but only holds values in the leaf nodes (the nodes on the bottom).
- A B+ Tree is sorted (values ascend from left leaf node to right leaf node).
Here is a great tool for visualizing what a B+ Tree does!
Hash index
A hash index takes the key of the value you’re indexing and hashes it into buckets. In case you’re not aware, a hash function is something that takes an input, and produces a different, somewhat random, and hopefully unique, output.
- “Somewhat random” - Sometimes you actually want the hash function to place certain things somewhat close together on disk, because they are frequently read together. In this instance, you would want to use a “location sensitive hash function”.
- “Hopefully unique” - Sometimes two different inputs produce the same output. This is called a hash collision. In this case, the database usually uses a different hash function to resolve the collision. Hashing these values allows you to do 0(1) equality lookups based on the value of the indexed column. All you have to do to find the exact location of the rows with that value is hash the value you’re trying to find, and look in the place where the hash value leads you!
When to use each type
- B+ Tree
- Is usually the default index type
- If you don’t know what to use, just use a B+ Tree
- Excellent for integer and date columns, where you will frequently compare values.
- Is very fast for “range style” queries.
- B+ Trees store links between each leaf node, essentially creating a sorted doubly linked list!
- Lookup time:
log f (N/F)
disk IOsf
- The “fanout” of your B+ Tree, or how many children each node has.N
- The number of disk pages needed to hold the indexed keysF
- The “slack” of our tree. Usually 2/3.- Slack, in this instance, is how many nodes we build with empty values. Keeping empty values in the tree allows us to do inserts without moving a ton of data around, and keeps the overhead of maintaining the index lower.
- Example
- Let’s say you’re indexing 1 trillion integers (8 bytes each), or 800 billion bytes (800 GB).
- Assume your DB blocks are 64 MB.
N
is 12500 pages (800 GB / 64 MB) to hold indexed values.f
of 5460F
of 2/3- Applying the formula above, plus one extra disk IO to find the root node, comes out to (drum roll please)… 3 disk IOs to look up a single value in 800 GB of data! Wowza!
- The Hash index
- Is great for exact equality operations.
- This makes it ideal for UUID and other text columns.
- Lookup time:
O(1)
. Its hard to beat that!
- Is great for exact equality operations.