Discovering the Computer Science Behind Postgres Indexes

Development

Reading Time: 13 minutes

This is the last in a series of Postgres posts that Pat Shaughnessy wrote based on his presentation at the Barcelona Ruby Conference. You can also watch the video recording of the presentation. The series was originally published on his personal blog, and we are republishing it on Codeship with his kind permission. You can also read posts one, two, and three in the series.

engineering-plans
Captain Nemo and Professor Aronnax
discussing the engineering and
behind The Nautilus.

We all know indexes are one of the most powerful and important features of relational database servers. How do you search for a value quickly? Create an index. What do you have to remember to do when joining two tables together? Create an index. How do you speed up a SQL statement that’s beginning to run slowly? Create an index.

But what are indexes, exactly? And how do they speed up our database searches?

To find out, I decided to read the C source code inside the PostgreSQL database server, to follow along as it searched an index for a simple string value. I expected to find sophisticated algorithms and efficient data structures. And I did.

Today I will show you what indexes look like inside Postgres and explain how they work. What I didn’t expect to find — what I discovered for the first time reading the Postgres C source code — was the Computer Science theory behind what it was doing. Reading the Postgres source was like going back to school and taking that class I never had time for when I was younger. The C comments inside Postgres not only explain what Postgres does, but why.

Sequence Scans: A Mindless Search

When we left the crew of the Nautilus, they were exhausted and beginning to faint: the Postgres sequence scan algorithm was mindlessly looping over all of the records in the users table!

Recall in my last post we had executed this simple SQL statement to find Captain Nemo: select-users

Postgres first parsed, analyzed and planned the query. Then ExecSeqScan, the C function inside of Postgres that implements the sequence scan (SEQSCAN) plan node, quickly found Captain Nemo:

seqscan4

But then inexplicably Postgres continued to loop through the entire user table, comparing each name to “Captain Nemo,” even though we had already found what we were looking for!

seqscan5

Imagine if our users table had millions of records; this could take a very long time. Of course, we could have avoided this by removing the sort and rewriting our query to accept the first name, but the deeper problem here is the inefficient way Postgres searches for our target string.

Using a sequence scan to compare every single value in the users table with “Captain Nemo” is slow, inefficient and depends on the random order the names appear in the table. What are we doing wrong? There must be a better way!

The answer is simple: We forgot to create an Index. Let’s do that now.

Creating an Index

Creating an index is straightforward — we just need to run this command:

create-index

As Ruby developers, of course, we would use the add_index ActiveRecord migration instead; this would run the same CREATE INDEX command behind the scenes. When we rerun our select statement, Postgres will create a plan tree as usual — but this time the plan tree will be slightly different:

indexscan-plan

Notice at the bottom Postgres now uses INDEXSCAN instead of SEQSCAN. Unlike SEQSCAN, INDEXSCAN won’t iterate over the entire users table. Instead, it will use the index we just created to find and return the Captain Nemo records quickly and efficiently.

Creating an index has solved our performance problem, but it’s also left us with many interesting, unanswered questions:

  • What is a Postgres index, exactly?
  • If I could go inside of a Postgres database and take a close look at an index, what would it look like?
  • And how does an index speed up searches?

Let’s try to answer these questions by reading the Postgres C source code.

What Is a Postgres Index, Exactly?

We can get started with a look at the documentation for the CREATE INDEX command.

documentation

Here you can see all of the options we can use to create an index, such as UNIQUE and CONCURRENTLY. Notice there’s an option called USING method. This tells Postgres what kind of index we want. Farther down the same page is some information about method, the argument to the USING keyword:

method

It turns out Postgres implements four different types of indexes. You can use them for different types of data and in different situations. Because we didn’t specify USING at all, our index_users_on_name index is a “btree” (or B-Tree) index, the default type.

This is our first clue: a Postgres index is a B-Tree. But what is a B-Tree? Where can we find one? Inside of Postgres, of course! Let’s search the Postgres C source code for files containing “btree:”

find

The key result is in bold: “./backend/access/nbtree.” Inside this directory is a README file; let’s read it:

readme

atlantis
Nemo found the lost continent
of Atlantis next to an
underwater volcano.

Amazingly, this README file turns out to be an extensive 12-page document! The Postgres source code not only contains helpful and interesting C comments, it also contains documentation about the theory and implementation of the database server.

Reading and understanding the code in open source projects can often be intimidating and difficult, but not for Postgres. The developers behind Postgres have gone to great lengths to help the rest of us understand their work.

The title of the README document, “Btree Indexing,” confirms this directory contains the C code that implements Postgres B-Tree indexes. But the first sentence is even more interesting: it’s a reference to an academic paper that explains what a B-Tree is, and how Postgres indexes work: Efficient Locking for Concurrent Operations on B-Trees, by Lehman and Yao.

We’ll find our first look at a B-Tree inside this academic paper.

What Does a B-Tree Index Look Like?

Lehman and Yao’s paper explains an innovation they made to the B-Tree algorithm in 1981. I’ll discuss this a bit later. But they start with a simple introduction to the B-Tree data structure, which was actually invented 9 years earlier in 1972. One of their diagrams shows an example of a simple B-Tree:

figure2

The term B-Tree actually stands for “balanced tree.” B-Trees make searching easy and fast. For example, if we wanted to search for the value 53 in this example, we first start at the root node which contains the value 40:

node1

We compare our target value of 53 with the value we find in the tree node. Is 53 greater than or less than 40? Because 53 is greater than 40, we follow the pointer down to the right. If we were searching for 29, we would go down to the left. Pointers on the right lead to larger values; pointers on the left to smaller ones.

Following the pointer down the tree to the next child tree node, we encounter a node that contains 2 values:

node2

This time we compare 53 with both 47 and 62, and find that 47 < 53 < 62. Note the values in the tree node are sorted, so this will be easy to do. This time we follow the center pointer down. Now we get to another tree node, this one with 3 values in it:

node3

Looking through the sorted list of numbers, we find 51 < 53 < 56, and follow the second of four pointers down. Finally, we come to a leaf node in the tree:

node4

And we’ve found the value 53!

B-Trees speed up searches because:

  • They sort the values (known as keys) inside of each node.
  • They are balanced: B-Trees evenly distribute the keys among the nodes, minimizing the number of times we have to follow a pointer from one node to another. Each pointer leads to a child node that contains more or less the same number of keys each other child node does.

Sign up for a free Codeship Account

What Does a Postgres Index Look Like?

Lehman and Yao drew this diagram over 30 years ago — what does it have to do with how Postgres works today? Astonishingly, the index_users_on_name index we created earlier looks very similar to Figure 2: We created an index in 2014 that looks just like a diagram from 1981!

When we executed the CREATE INDEX command, Postgres saved all of the names from our users table into a B-Tree. These became the keys of the tree. Here’s what a node inside a Postgres B-Tree index looks like:

index-tuple-data1

Each entry in the index consists of a C structure called IndexTupleData and is followed by a bitmap and a value. Postgres uses the bitmap to record whether any of the index attributes in a key are NULL, to save space. The actual values in the index appear after the bitmap.

Let’s take a closer look at the IndexTupleData structures:

index-tuple-data2

Above you can see each IndexTupleData structure contains:

  • t_tid: This is a pointer to either another index tuple, or to a database record. Note this isn’t a C pointer to physical memory; instead, it contains numbers Postgres can use to find the referenced value among its memory pages.
  • t_info: This contains information about the index tuple, such as how many values it contains, and whether or not there are null values.

To understand this better, let’s show a few entries from our index_users_on_name index:

index-tuple-data3

Now I’ve replaced “value” with some names from my users table. The upper tree node includes the keys “Dr. Edna Kunde” and “Julius Powlowski,” while the lower tree node contains “Julius Powlowski” and “Juston Quitzon.”

Notice that, unlike Lehman and Yao’s diagram, Postgres repeats the parent keys in each child node. Here “Julius Powlowski” is a key in the upper node and in the child node. The t_tid pointer from Julius in the upper node references the same Julius name in the lower node.

To learn more about exactly how Postgres stores key values into a B-Tree node, refer to the itup.h C header file:

IndexTupleData

Finding the B-Tree Node Containing Captain Nemo

Now let’s return to our original SELECT statement again:

root-names1

How exactly does Postgres search our index_users_on_name index for “Captain Nemo?” Why is using the index faster than the sequence scan we saw in my last post? To find out, let’s zoom out a bit and take a look at some of the user names in our index:

This is the root node of the index_users_on_name B-Tree. I’ve turned the tree on its side so the names would fit. You can see 4 names and a NULL value. Postgres created this root node when I created index_users_on_name.

Note that, aside from the first NULL value which represents the beginning of the index, the other 4 names are more or less evenly distributed in alphabetical order.

Remember that a B-Tree is a balanced tree. In this example, the B-Tree has 5 child nodes:

  • the names that appear before Dr. Edna Kunde alphabetically
  • names that appear between Dr. Edna Kunde and Julius Powlowski
  • names that appear between Julius Powlowski and Monte Nicolas
  • etc…

Because we’re searching for Captain Nemo, Postgres follows the first, top arrow to the right. This is because Captain Nemo comes before Dr. Edna Kunde alphabetically:

root-names2

You can see on the right that Postgres has found the B-Tree node that contains Captain Nemo. For my test I added 1000 names to the users table; this child node in the B-Tree contained about 200 names (240 actually). The B-Tree has narrowed down Postgres’s search considerably.

To learn more about the precise algorithm Postgres uses to search for the target B-Tree node among all of the nodes in the tree, read the _bt_search function.

_bt_search

Finding Captain Nemo Inside a Single B-Tree Node

Now that Postgres has narrowed down the search to a B-Tree node containing about 200 names, it still has to find Captain Nemo… how does it do this? Does it perform a sequence scan on this shorter list?

binary-search1

No. To search for a key value inside of a tree node, Postgres switches to use a binary search algorithm. It starts by comparing the key that appears at the 50% position in the tree node with “Captain Nemo:”

binary-search2

Because Captain Nemo comes after Breana Witting alphabetically, Postgres jumps down to the 75% position and performs another comparison:

binary-search3

This time Captain Nemo comes before Curtis Wolf, so Postgres jumps back a bit. Skipping a few more steps (it actually took Postgres 8 comparisons to find Captain Nemo in my example), Postgres eventually finds what we are looking for:

binary-search4

To learn more about exactly how Postgres searches for a value in a single B-Tree node, read the _bt_binsrch function:

bt_binsirch

So Much More to Learn

I don’t have space in this blog post to cover many other fascinating details about B-Trees, database indexes or Postgres internals… maybe I should write Postgres Under a Microscope. But for now, here are just a few interesting bits of theory you can read about in Efficient Locking for Concurrent Operations on B-Trees or in the other academic papers it references.

  • Inserting into B-Trees: The most beautiful part of the B-Tree algorithm has to do with inserting new keys into a tree. Key are inserted in sorted order into the proper tree node — but what happens when there’s no more room for a new key? In this situation, Postgres splits the node into two, inserts the new key into one of them, and also adds the key from the split point into the parent node, along with a pointer to the new child node. Of course, the parent node might also have to be split to fit its new key, resulting in a complex, recursive operation.
  • Deleting from B-Trees: The converse operation is also interesting. When deleting a key from a node, Postgres will combine sibling nodes together when possible, removing a key from their parent. This can also be a recursive operation.
  • B-Link-Trees: Lehman and Yao’s paper actually discusses an innovation they researched related to concurrency and locking when multiple threads are using the same B-Tree. Remember, Postgres’s code and algorithms need to be multithreaded because many clients could be searching or modifying the same index at the same time. By adding another pointer from each B-Tree node to the next sibling node — the so-called “right arrow” — one thread can search a tree even while a second thread is splitting a node without locking the entire index:

right-arrow

Don’t Be Afraid To Explore Beneath The Surface

at-the-helm
Captain Nemo at the helm

Professor Aronnax risked his life and career to find the elusive Nautilus and to join Captain Nemo on a long series of amazing underwater adventures.

We should do the same: Don’t be afraid to dive underwater — inside and underneath the tools, languages and technologies that you use every day. You may know all about how to use Postgres, but do you really know how Postgres itself works internally?

Take a look inside; before you know it, you’ll be on an underwater adventure of your own.

Studying the Computer Science at work behind the scenes of our applications isn’t just a matter of having fun, it’s part of being a good developer.

As software development tools improve year after year, and as building web sites and mobile apps becomes easier and easier, we shouldn’t lose sight of the Computer Science we depend on. We’re all standing on the shoulders of giants — people like Lehman and Yao, and the open source developers who used their theories to build Postgres.

Don’t take the tools you use everyday for granted — take a look inside them! You’ll become a wiser developer and you’ll find insights and knowledge you could never have imagined before.

Discuss this article on HackerNews: https://news.ycombinator.com/item?id=9759560

Subscribe via Email

Over 60,000 people from companies like Netflix, Apple, Spotify and O'Reilly are reading our articles.
Subscribe to receive a weekly newsletter with articles around Continuous Integration, Docker, and software development best practices.



We promise that we won't spam you. You can unsubscribe any time.

Join the Discussion

Leave us some comments on what you think about this topic or if you like to add something.

  • Pingback: Bookmarks for June 22nd | Chris's Digital Detritus()

  • Sean McCleary

    “The term B-Tree actually stands for “balanced tree.””

    B-tree means “binary tree” and are not inherently balanced. There are plenty
    of specific types of self balancing tree algorithms, but I believe the
    default implementation of b-trees by definition are not self-balanced. https://en.wikipedia.org/?title=B-tree

  • Sigi Kiermayer

    Thank you! Was a nice read :-)

  • Torsten Grust

    Hi. Nice article.

    According to Rudolf Bayer, co-inventor of the B-tree, the “B” in “B-tree” does not stand for “balanced”. It also does not stand for “Bayer” or “Boeing” (his affiliation when he and McCreight invented the B-tree). Many times we’ve tried to lure Rudolf into revealing the secret, but he just wouldn’t say. ;-)

    • patshaughnessy

      Oh ok, thanks!

      I don’t remember exactly where, but last year while researching for this I saw the term “B-tree” referred to as a “balanced tree” in an academic journal article. If you think it’s important, I can try to track down which article it was so we can set the record straight :)

  • alexeiramone

    Very nice indeed, thanks

  • Harry Moreno

    Very nice post. I followed along and got myself a copy of the postgres source.
    Do you happen to know why I couldn’t git clone it? http://git.postgresql.org/gitweb/?p=postgresql.git;a=summary

    Something I was surprised by were the diagrams. I thought b-trees only pointed left and right. But here they are shown also with many pointers. So they act more like buckets and split or merge depending on the number of entries. I must have confused them with binary search trees.

    • Torsten Grust

      Indeed, the “B” in “B-tree” (also) does not stand for “binary”. ;-) The size of a B-tree node typically matches that of a page on disk (8 or 16 kB, say). Implementations try to fit as many (key,pointer) pairs as possible into such a page to achieve a high fan-out. A high fan-out leads to a shallow tree — which translates into less page I/Os when the B-tree is searched for a key.

    • patshaughnessy

      Thanks!

      Not sure about git.postgresql.org – I haven’t tried using this git server before. I just downloaded the source as a tar ball. In case you’re interested, I typed up some instructions for downloading and setting up the Postgres source for debugging on a Mac using LLDB:

      https://gist.github.com/patshaughnessy/70519495343412504686

      This is what I did last year while writing this article.

  • Gray MacGregor

    Excellent! I thoroughly enjoyed this write-up. Thanks for going into such detail. I knew I wasn’t the only one who has wondered this.