Concurrency, Isolation, and MVCC: How Engines Handle Contention

06 Mins read

Concurrency, Isolation, and MVCC: How Engines Handle Contention

Read the complete Query Engine Optimization series:

Every production database serves multiple concurrent users. Without concurrency control, simultaneous reads and writes produce corrupted data, inconsistent query results, or both. The question is not whether to control concurrency, but how much control to impose and what performance to sacrifice for it.

The Core Problem

Consider two transactions running simultaneously:

  • Transaction A reads a customer’s balance (currently $500) and subtracts $100.
  • Transaction B reads the same balance ($500) and subtracts $200.

Without concurrency control, both transactions read $500, compute their results independently, and write back. Transaction A writes $400. Transaction B overwrites it with $300. The correct result ($200) is never produced. This is a lost update, and it destroys data integrity.

Two-Phase Locking (2PL)

The oldest approach is locking. Two-Phase Locking enforces a simple rule: a transaction must acquire all the locks it needs before releasing any of them.

Two-Phase Locking versus MVCC showing how locking blocks readers while MVCC allows concurrent access

Shared locks allow multiple readers but block writers. Exclusive locks block both readers and writers. When Transaction B tries to write a row that Transaction A holds a shared lock on, Transaction B waits until A releases the lock.

Strengths: Correctness is straightforward. If you hold the lock, no one else can interfere.

Weaknesses: Readers block writers. Writers block readers. Under high concurrency, transactions spend more time waiting for locks than doing useful work. Deadlocks arise when two transactions each hold a lock the other needs. The engine must detect the cycle and abort one transaction.

MySQL/InnoDB uses row-level locking for write operations. SQL Server uses lock escalation (row to page to table) when too many individual locks are held. Both systems also implement MVCC to reduce reader-writer conflicts.

MVCC: Readers Never Block

Multi-Version Concurrency Control solves the reader-writer conflict by keeping multiple versions of each row. Writers create new versions instead of overwriting the current one. Readers see the version that was current when their transaction started.

MVCC version chain showing three versions of the same row with different transactions seeing different versions

How it works:

  1. Each transaction gets a snapshot identifier when it starts (typically a transaction ID or timestamp).
  2. When a transaction reads a row, the engine walks the version chain and returns the most recent version that was committed before the transaction’s snapshot.
  3. When a transaction writes a row, it creates a new version. The old version remains available for transactions that started earlier.
  4. A background garbage collection process (PostgreSQL calls it VACUUM) removes old versions that no transaction can see anymore.

The key property: Readers never block and are never blocked. A long-running analytical query sees a consistent snapshot of the entire database as it existed at the moment the query started, even if other transactions commit changes during execution.

Weaknesses: Version storage consumes space. PostgreSQL stores old versions in the heap table itself, requiring VACUUM to reclaim space. If VACUUM falls behind, the table bloats and performance degrades. Oracle and MySQL/InnoDB store old versions in a separate undo log, which is cleaner but adds complexity.

PostgreSQL, Oracle, MySQL/InnoDB, SQL Server, CockroachDB, DuckDB, Snowflake, and Dremio all use MVCC. It is the dominant concurrency control mechanism in modern databases.

Isolation Levels

Isolation level spectrum from Read Uncommitted (weakest, fastest) to Serializable (strongest, slowest)

The SQL standard defines four isolation levels that control what anomalies a transaction can observe:

LevelPreventsAllowsPerformance
Read UncommittedNothingDirty reads, non-repeatable reads, phantomsFastest
Read CommittedDirty readsNon-repeatable reads, phantomsFast
Repeatable ReadDirty reads, non-repeatable readsPhantoms (in some systems)Moderate
SerializableAll anomaliesNothingSlowest

Dirty read: Transaction A sees uncommitted changes from Transaction B. If B rolls back, A has acted on data that never existed.

Non-repeatable read: Transaction A reads a row, Transaction B modifies and commits it, Transaction A reads the same row and gets a different value.

Phantom: Transaction A runs a query with a range condition, Transaction B inserts a new row matching that condition and commits, Transaction A re-runs the query and gets an extra row.

Most production systems default to Read Committed (PostgreSQL, Oracle, SQL Server) or Repeatable Read (MySQL/InnoDB). Serializable provides the strongest guarantees but at the highest cost: either through strict two-phase locking (which reduces concurrency) or serializable snapshot isolation (which detects conflicts and aborts transactions).

Optimistic Concurrency Control (OCC)

OCC takes the opposite approach from locking: assume conflicts are rare and do not acquire locks during the transaction. Instead, the transaction reads and writes freely, then checks for conflicts at commit time.

  1. Read phase: The transaction executes all reads and writes in a local workspace.
  2. Validation phase: At commit time, the engine checks whether any data the transaction read was modified by another committed transaction since it started.
  3. Write phase: If validation passes, the changes are written permanently. If not, the transaction is aborted and must retry.

Strengths: No lock contention during execution. If conflicts are truly rare, OCC achieves high throughput because transactions never wait.

Weaknesses: If conflicts are frequent, transactions are repeatedly aborted and retried, wasting all the work done before validation. OCC works well when contention is low and transactions are short.

CockroachDB and TiDB use forms of optimistic concurrency control. Google Spanner uses a hybrid approach.

How Lakehouse Table Formats Handle Concurrency

Apache Iceberg, Delta Lake, and Apache Hudi take a fundamentally different approach to concurrency because they operate on immutable files in object storage rather than mutable database pages.

Iceberg’s approach: Writes produce new data files and new metadata files (manifests, snapshot). The commit is an atomic pointer swap of the metadata file. Concurrent writers that do not conflict (e.g., inserting into different partitions) both succeed via optimistic concurrency with retry. Conflicting writes (e.g., both deleting the same rows) are detected at commit time and one writer retries.

Readers always see a consistent snapshot because they read from a fixed snapshot pointer. There is no locking, no blocking, and no VACUUM needed. Old snapshots and their data files are cleaned up by an explicit expire_snapshots operation.

This model is why lakehouse engines like Dremio, Spark, and Trino can run long analytical queries concurrently with ongoing data ingestion without any interference. The reader sees the snapshot that existed when the query started; the writer creates a new snapshot that future queries will see.

Where Real Systems Land

SystemPrimary MechanismDefault IsolationWrite ConflictsGarbage Collection
PostgreSQLMVCC (heap-stored versions)Read CommittedRow-level lockingVACUUM (autovacuum)
MySQL/InnoDBMVCC (undo log) + row locksRepeatable ReadRow-level lockingPurge thread
OracleMVCC (undo tablespace)Read CommittedRow-level lockingAutomatic undo management
CockroachDBMVCC + OCCSerializableOptimistic with retryGC job
DuckDBMVCCSnapshotSingle-writer lockAutomatic
SnowflakeMVCC (micro-partition versioning)Read CommittedAutomatic conflict detectionAutomatic
Dremio + IcebergSnapshot isolation (immutable files)SnapshotOptimistic commit with retryexpire_snapshots
Spark + Delta LakeOptimistic concurrency (transaction log)Snapshot / SerializableConflict detection at commitVACUUM

The Fundamental Tradeoff

Every concurrency control mechanism trades throughput for correctness guarantees:

  • Stronger isolation (Serializable, strict locking) prevents more anomalies but reduces the number of transactions that can run concurrently.
  • Weaker isolation (Read Committed, optimistic) allows more concurrent transactions but permits anomalies that application code must handle.
  • MVCC with snapshot isolation provides a pragmatic middle ground: readers never block, writers are serialized on conflicting rows, and the only anomaly permitted (write skew) is rare in most applications.

Most analytical engines (Dremio, Snowflake, BigQuery, DuckDB) default to snapshot isolation because analytical workloads are read-heavy with infrequent writes. The readers-never-block property of MVCC is exactly what long-running analytical queries need.

There is no single best concurrency control strategy. The right choice depends on your ratio of reads to writes, the frequency of conflicts, and how much application complexity you are willing to accept.

Books to Go Deeper

Share :

Related Posts

Nessie -  An Alternative to Hive & JDBC for Self-Managed Apache Iceberg Catalogs

Nessie - An Alternative to Hive & JDBC for Self-Managed Apache Iceberg Catalogs

Unlike traditional table formats, Apache Iceberg provides a comprehensive solution for handling big data's complexity, volume, and diversity. It's designed to improve data processing in various analyt...

Open Lakehouse Engineering/Apache Iceberg Lakehouse Engineering - A Directory of Resources

Open Lakehouse Engineering/Apache Iceberg Lakehouse Engineering - A Directory of Resources

The concept of the **Open Lakehouse** has emerged as a beacon of flexibility and innovation. An Open Lakehouse represents a specialized form data lakehouse (bringing data warehouse like functionality/...