Understanding LSM Trees in 5 minutes

7 May 2025

What is an LSM Tree?

LSM Tree stands for Log-Structured Merge-Tree. It is a data structure widely used in systems requiring high write throughput, such as many NoSQL databases (e.g., LevelDB, RocksDB, Cassandra, InfluxDB).

The core idea behind LSM Trees is to convert random write operations into sequential write operations on disk, thereby significantly improving write performance. Sequential writes are generally much faster than random writes, especially on Hard Disk Drives (HDDs). For Solid State Drives (SSDs), although random write performance is much better than HDDs, sequential writes still offer performance benefits and help extend SSD lifespan.

Core Components of an LSM Tree

An LSM Tree primarily consists of the following components:

MemTable:
- A MemTable is an in-memory data structure, typically an ordered structure (like a SkipList or Red-Black Tree).
- All new write requests (inserts, updates, deletes) first go into the MemTable.
- Since data is written directly to memory, this process is very fast.
- When the MemTable's size reaches a predefined threshold, it is frozen and flushed to disk.
Write-Ahead Log (WAL):
- Before or concurrently with writing data to the MemTable, operations are appended to a WAL file.
- The WAL ensures data durability. If the system crashes before data from the MemTable is flushed to disk, the MemTable can be reconstructed by replaying the WAL, preventing data loss.
- WAL files are written sequentially, making these writes very fast as well.
SSTable (Sorted String Table):
- SSTables are immutable files stored on disk.
- When a MemTable is frozen, its contents are sorted and written sequentially to disk as a new SSTable file.
- Key-value pairs within an SSTable are sorted by key, which allows for efficient lookups during read operations.
- Once written, SSTable files are never modified (immutability). Updates or deletes are handled by writing new records (or "tombstone" markers) in newer SSTables (or the MemTable).
- Over time, multiple SSTable files accumulate on disk, often organized into levels.

LSM Tree Operations

1. Write (Put)

Step 1: Record the write operation (key-value pair) in the WAL (to ensure durability).
Step 2: Write the key-value pair to the MemTable.
If the MemTable reaches its size threshold:
- The current MemTable becomes immutable (Immutable MemTable).
- A new MemTable is created for subsequent writes.
- The contents of the Immutable MemTable are asynchronously sorted and flushed to disk, forming a new SSTable file (typically at Level 0).

2. Read (Get)

Read operations can be more complex than in a simple B-Tree structure because they may need to query multiple storage locations:

Step 1: First, search for the specified key in the MemTable. If found, return the corresponding value.
Step 2: If not found in the MemTable, then query the SSTable files on disk. This usually starts from the newest SSTables (Level 0) and proceeds to older SSTables in lower levels.
Step 3: Since data within SSTables is sorted by key, efficient lookups (e.g., binary search, or locating via a sparse index) can be performed within each SSTable.
Step 4: Because there might be multiple updates or delete operations for the same key (stored in different SSTables or the MemTable), the read operation must return the most recent valid value. If a "tombstone" marker is found, it indicates the key has been deleted, so the search stops, and "not found" or the deletion status is returned.

To speed up reads, Bloom Filters are commonly used. Each SSTable can be associated with a Bloom Filter, which can quickly determine if a key definitely does not exist in that SSTable, thus avoiding unnecessary disk I/O.

3. Delete

Delete operations do not immediately remove data from MemTables or SSTables.
Instead, a special marker, called a "tombstone," is inserted into the MemTable, indicating that the key has been deleted.
This tombstone marker is flushed to SSTables just like regular data.
The actual data removal (physical deletion) occurs during a subsequent Compaction process.

4. Flush

When a MemTable becomes full, its contents are sorted and written to disk as a new SSTable file. This process is called a flush.
Flush operations are typically asynchronous and do not block new write requests (as new writes go to a newly created MemTable).

5. Compaction

Over time, a large number of SSTable files can accumulate on disk. This leads to issues such as:

Read Amplification: Reading a key might require querying multiple SSTable files, degrading read performance.
Space Amplification: Old, overwritten, or deleted data (tombstones) still occupy disk space.

The Compaction process is designed to address these problems. It runs periodically in the background and its main tasks include:

Merging SSTables: Selecting one or more SSTable files and reading their contents.
Discarding Redundant Data: For the same key, only the most recent version is kept.
Cleaning Up Deleted Data: If the latest version of a key is a tombstone, and the tombstone is old enough (ensuring all queries that might reference the key have completed), then the key and its tombstone can be physically deleted.
Generating New SSTables: Writing the merged and cleaned-up data into new SSTable files.
Deleting Old SSTables: Once the new SSTables are safely written, the old, merged SSTable files can be deleted.

There are different compaction strategies, for example:

Size-Tiered Compaction: When there are multiple SSTables of similar size at the same level, they are merged into a larger SSTable, which is then pushed to the next level.
Leveled Compaction: SSTables are organized into multiple levels (Level 0, Level 1, ..., Level N). Level 0 SSTables may have overlapping key ranges. SSTables at Level 1 and higher typically have non-overlapping key ranges. Compaction merges an SSTable from Level i with SSTables from Level i+1 that have overlapping key ranges.

LSM Trees and MVCC (Multi-Version Concurrency Control)

LSM Trees naturally lend themselves to implementing Multi-Version Concurrency Control (MVCC). MVCC is a concurrency control method used by database management systems to provide concurrent access to the database and to provide snapshots (versions) of the data to each transaction.

How MVCC works with LSM Trees:

Versioning Data:
- When a key is written (inserted or updated), instead of overwriting the old data, a new version of the key-value pair is created. This new version is typically associated with a transaction ID or a timestamp that indicates when it was created.
- In an LSM Tree, this new version (e.g., (key, value, timestamp/version)) is written to the MemTable and subsequently flushed to SSTables.
- Deletes are also versioned by writing a tombstone marker with a timestamp/version (e.g., (key, tombstone, timestamp/version)).
Read Operations (Snapshot Isolation):
- When a transaction reads data, it is provided with a "snapshot" of the database as of a particular point in time (usually the start time of the transaction).
- To read a key, the system searches the MemTable and SSTables (from newest to oldest). It looks for the version of the key whose timestamp/version is the latest one that is still less than or equal to the transaction's snapshot timestamp.
- If the most recent version found within the snapshot's visibility is a tombstone, the key is considered deleted for that transaction.
- This ensures that a transaction sees a consistent view of the data, unaffected by concurrent modifications made by other transactions that commit after the reading transaction's snapshot time.
Garbage Collection of Old Versions:
- Over time, many versions of the same key can accumulate, leading to space amplification.
- The compaction process in LSM Trees plays a crucial role in garbage collecting old, no-longer-visible versions of data.
- During compaction, when merging SSTables, the system can identify and discard versions that are no longer visible to any active or future transactions. For example, if there are multiple versions of a key k1: (k1, v1, t1), (k1, v2, t2), (k1, v3, t3) where t1 < t2 < t3, and v2 has been superseded by v3, and v1 is older than any active transaction's snapshot, then v1 (and potentially v2 if it's also old enough and superseded) can be garbage collected. Tombstones also eventually get garbage collected once they are older than any active transaction snapshot and have been "seen" by relevant processes.

Benefits of MVCC in LSM Trees:

Non-Blocking Reads: Readers do not block writers, and writers do not block readers. This is a significant advantage for high-concurrency workloads.
Snapshot Isolation: Transactions operate on a consistent snapshot of the data, simplifying application logic and preventing anomalies like dirty reads.
Time-Travel Queries (Potentially): Storing multiple versions allows for querying the state of the data as it existed at some point in the past, although this requires careful management of old versions and may not be a primary goal of all LSM-based MVCC implementations.

The append-only nature of writing to MemTables and then creating immutable SSTables fits very well with the MVCC model of creating new versions rather than updating in place. Compaction naturally provides a mechanism for cleaning up old, unneeded versions.

Advantages and Disadvantages of LSM Trees

Advantages

High Write Throughput: This is the core advantage. By converting random writes into in-memory operations and sequential writes on disk, write speeds are very high.
Sequential Write Friendly: Beneficial for both HDDs (reduces seek time) and SSDs (reduces write wear).
Good Compression Ratios: Because data is sorted in SSTables and compaction removes old data, good data compression can be achieved.
Well-suited for MVCC: As described above, the structure facilitates efficient multi-version concurrency control.

Disadvantages

Potentially Lower Read Performance: Read operations might need to query the MemTable and multiple SSTable files, especially when data is spread across many SSTables and multiple versions need to be considered. Bloom Filters and compaction help mitigate this.
Compaction Overhead: The compaction process consumes CPU and disk I/O resources, which can impact system performance. Compaction strategies need careful tuning, especially with MVCC, as it also handles garbage collection of old versions.
Space Amplification: Before compaction occurs, old versions of data, superseded data, and tombstone markers temporarily consume disk space. MVCC can exacerbate this if old versions are retained for long periods.
Implementation Complexity: Compared to traditional structures like B-Trees, a full implementation of an LSM Tree (including WAL, multi-level compaction, MVCC, concurrency control, etc.) is more complex.

Use Cases

LSM Trees are well-suited for:

Write-intensive applications: Such as logging systems, time-series databases, message queues, etc.
Systems requiring high data ingestion rates.
Systems where the dataset is much larger than available memory.
Databases requiring snapshot isolation and high concurrency (often leveraging MVCC).

Many well-known databases and storage systems use an LSM Tree architecture, including:

Google LevelDB
Facebook RocksDB (an optimized and enhanced fork of LevelDB)
Apache Cassandra
Apache HBase
InfluxDB
ScyllaDB
CockroachDB (uses RocksDB with its own transaction layer providing MVCC)
TiDB (uses RocksDB or a custom LSM-based engine, with MVCC at a higher layer)

Hopefully, this tutorial helps you understand the basic principles of LSM Trees and their relationship with MVCC!