Technical

CTE and closure tables

featured 1

There is a lot of techniques to store and manipulate tree-like structures in a relational database. Among them, the one that is noteworthy for his simplicity and performance is the Closure Table approach.

The basic idea is to store all the ancestor-descendant relation and their depths. So from this situation:

    id          name        parent
    ----------  ----------  ----------
    1           One         NULL
    2           Two         1
    3           Three       2
    4           Four        3

We add the following table (call the closure):

    parent      child       depth
    ----------  ----------  ----------
    1           1           0
    2           2           0
    3           3           0
    4           4           0
    1           2           1
    2           3           1
    3           4           1
    1           3           2
    2           4           2
    1           4           3

So we easily get any descendant of a given id for any arbitrary depth.

Manipulation

While not trivial, the maintenance of the closure is not rocket science. The insertion goes like this:

# Collect the current ancestor of the parent (note the depth+1)
cursor.execute(
    'SELECT parent, ? as child, depth+1 FROM closure '
    'WHERE child = ?', (new_id, parent_id))

… and insert them

stm = ‘INSERT INTO closure (parent, child, depth) ‘
‘ VALUES (?, ?, ?)’
cursor.executemany(stm, self.cursor.fetchall())
cursor.execute(stm, (last_id, last_id, 0))

 

Where new_id is the id of the new record and parent_id is the parent under which this new record is added. The last line of code insert the line that tells that a given record is at depth zero with respect to itself. This is needed to get the parent-child relation (at depth one) when a child is added with the above code.

Reparenting is a bit more tricky to get right, so basically when a record is moved in the tree, all the descendants (the subtree under the record) must follow accordingly.

# Detach child
cursor.execute(
    'DELETE FROM closure '
    'WHERE child IN (SELECT child FROM closure where parent = ?) '
    'AND parent NOT IN (SELECT child FROM closure WHERE parent = ?)',
    (child_id, child_id)
)

# Collect the anscestors for the whole subtree
self.cursor.execute(
    'SELECT supertree.parent, subtree.child, '
    'supertree.depth + subtree.depth + 1 '
    'FROM closure AS supertree JOIN closure AS subtree '
    'WHERE subtree.parent = ? '
    'AND supertree.child = ?',
    (child_id, new_parent_id)
)

# Insert new relations
values = list(self.cursor)
self.cursor.executemany(
    'INSERT INTO closure (parent, child, depth) values (?, ?, ?)',
    values
)

Where child_id is the id of the record that is moved and parent_id is the id under which the record is moved.

The first query (“Detach child”) means that we delete all the relations that land in the subtree of the record and that come from outside this subtree. So we keep the intra-relations, because the relative depth of any two elements in the subtree is stable.

The second and third queries (“Collect ancestor” and “Insert new relations”) are a generalisation of the insert technique: we generate all the new relations between the new ancestors and the subtree that is moved and we sum all the depths.

The queries here above are a variation of two different solutions and is the one implemented inside Menger.

Common table expressions

CTE is supported by Postresql since the 8.4 release and appeared in SQLite in February 2014.

CTE allows us to define temporary views that can be constructed in a recursive way. This means we can create queries that will travel our hierarchies until an arbitrary depth and do it with no additional table or column. Example:

WITH RECURSIVE test_cte(id, parent, depth) AS (
  SELECT id, parent, 1 FROM test WHERE parent = 42
 UNION ALL
  SELECT test.id, test_cte.parent, depth+1 FROM test
   JOIN test_cte ON (test.parent = test_cte.id)
)
SELECT id FROM test_cte WHERE depth = 3")

Where “test” is the actual table with the basic scheme presented at the top of this post and “test_cte” is the temporary view. As we can see test_cte is defined as the union of a parent record and a subselect on itself (aka the recursive part). The above query returns all the descendant of the given record (id=42) until a depth of 3.

This expression is not easy to understand but it’s the only one you need. A reparenting can be done with a simple

UPDATE TABLE test SET parent = x WHERE id = y

and a deletion of a subtree can be done by swapping the last select statement with a delete.

Performance comparison

To compare the performance of the above techniques, we can use a small benchmark that hopefully comes close to a real use case:

import sqlite3
import time

conn = sqlite3.connect('test.db')
cursor = conn.cursor()

cursor.execute(
    'CREATE TABLE IF NOT EXISTS test ('
    'id integer primary key, '
    'name varchar, '
    'parent integer REFERENCES test(id)'
    ')'
)

cursor.execute(
    'CREATE TABLE IF NOT EXISTS closure ('
    'parent integer REFERENCES test(id), '
    'child integer REFERENCES test(id), '
    'depth integer '
    ')'
)

def insert(levels, pos=0, parent=None):
    nb_items = levels[pos]
    parents = []

    # Create items for current level
    for i in range(nb_items):
        name = "%s-%s" % (parent, i)
        cursor.execute(
            'INSERT INTO test (name, parent) values(?, ?)',
            (name, parent))
        last_id = cursor.lastrowid
        parents.append(last_id)

        # Fill-in closure table
        cursor.execute(
            'SELECT parent, ? as child, depth+1 FROM closure '
            'WHERE child = ?', (last_id, parent))

        stm = 'INSERT INTO closure (parent, child, depth) '\
            ' VALUES (?, ?, ?)'
        cursor.executemany(stm, cursor.fetchall())
        cursor.execute(stm, (last_id, last_id, 0))


    # Recurse on next level
    if pos < len(levels) - 1:
        for parent in parents:
            insert(levels, pos+1, parent)

levels = list(range(2, 8))

t = time.time()
insert(levels)
print(time.time() - t)
conn.commit()

The above code creates a table called “test” and insert 5912 rows. It also create and populate the closure table with 34406 rows. This step takes only 0.03 second for the test table only and around 8 seconds with the closure content. So clearly in write-heavy scenario the closure approach should be excluded. Let’s now benchmark the time to drill inside a tree:

def test_cte():
    cursor.execute(
        "WITH RECURSIVE test_cte(id, parent, depth) AS ("
        "SELECT id, parent, 1 from test WHERE parent = 9 "
        "UNION ALL "
        "SELECT test.id, test_cte.parent, depth+1 FROM test "
        "JOIN test_cte ON (test.parent = test_cte.id) "
        ") "
        "SELECT id FROM test_cte where depth = 3")
    list(cursor)


def test_closure():
    cursor.execute(
        "SELECT child from closure "
        "WHERE parent = 9 AND depth = 3 and depth > 0"
    )
    list(cursor)


NB = 200
t = time.time()
for i in range(NB):
    test_cte()
print(time.time() - t)


t = time.time()
for i in range(NB):
    test_closure()
print(time.time() - t)

So we read all the descendant of a given record over 3 levels. The CTE version runs in 1.16 seconds and the closure version runs in 0.6 seconds. This means in a read-heavy scenario, the closure table should be preferred.

We can do even better if we add an index on the closure table like this

cursor.execute('CREATE INDEX closure_idx on closure (parent, depth)')

It doesn’t impact the insertion time but the read access is even better, the above test taking 0.05 seconds.