Architecture of a Database System¶

Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton

DOI: 10.1561/1900000002

Database Management Systems (DBMSs) are a ubiquitous and critical component of modern computing, and the result of decades of research and development in both academia and industry. Historically, DBMSs were among the earliest multi-user server systems to be developed, and thus pioneered many systems design techniques for scalability and reliability now in use in many other contexts. While many of the algorithms and abstractions used by a DBMS are textbook material, there has been relatively sparse coverage in the literature of the systems design issues that make a DBMS work. This paper presents an architectural discussion of DBMS design principles, including process models, parallel architecture, storage system design, transaction system implementation, query processor and optimizer architectures, and typical shared components and utilities. Successful commercial and open-source systems are used as points of reference, particularly when multiple alternative designs have been adopted by different groups.

1. Introduction¶

Relational Systems: The Life of a Query¶

Main Components of a DBMS

A Client Communications Manager establishes and persists the connection state for the caller, executes SQL commands from the caller, and returns control messages and data to the caller.
A Process Manager performs admission control after which it dispatches and schedules a "thread of computation" for the SQL commands from the Client Communications Manager.
A Relational Query Processor parses a SQL command into a Query Plan which is first authorized, then rewritten, later optimized, and finally executed.
A Transactional Storage Manager manages all data access and manipulation calls from the Relational Query Processor.
- Access Methods are algorithms and data structures for organizing and accessing data, such as tables and indexes, on disk.
- A Buffer Manager decides when and what data to transfer between disk and memory buffers.
- A Lock Manager ensures the correctness of concurrent execution by leasing locks before data are accessed.
- A Log Manager ensures that a transaction was durable if committed, and fully compensated if aborted.
At the end of a query or a transaction, the stack of components consisting of the Client Communication Manager, Process Manager, Relational Query Processor, and Transactional Storage Manager is unwound to return results to the caller and to deallocate allocated resources.
Aside, a Catalog Manager assists Query Parsing and Authentication, and Query Optimization.
Aside, a Memory Manager dynamically allocates and deallocates memory whenever and wherever throughout a DBMS.

2. Process Models¶

An Operating System Process is a kernel-scheduled, OS program execution unit with a private address space, a set of resource handles, and a security context.
An Operating System Thread is a kernel-scheduled, OS program execution unit without a private address space.
A Lightweight Thread Package is an application-scheduled, thread of execution. Advantageously, a Lightweight Thread is faster than an Operating System Thread. Disadvantageously, a Lightweight Thread can block all other threads within the same user-space scheduler.
A DBMS Client is the API used to communicate with a DBMS.
A DBMS Worker is the thread of execution in a DBMS that does work on behalf of a DBMS Client.

Uniprocessors and Lightweight Threads¶

Assumption #1 - OS Thread Support: Large Number of Kernel Threads + Small Memory Overhead + Inexpensive Context Switches.
Assumption #2 - Uniprocessor Hardware: Single Machine with Single CPU.

Process per DBMS Worker¶

Main Concept: A Dispatcher Process forks an Execution Process per DBMS Worker.
Advantage #1: The OS provides isolation and scheduling of DBMS Workers.
Disadvantage #1: Shared Components must be explicitly allocated in OS-supported shared memory accessible across all DBMS processes.
Disadvantage #2: Because an Operating System Process is more expensive than an Operating System Thread, Process per DBMS Worker is more expensive to scale.

Thread per DBMS Worker¶

Main Concept: A single multi-threaded process consists of a Dispatcher Thread allocating an Execution Thread per DBMS Worker.
Advantage #1: Thread per DBMS Worker scales better than Process per DBMS Worker.
Disadvantage #1: The OS does not provide any isolation of DBMS Workers.
Disadvantage #2: Debugging is difficult due to the shared execution and shared memory between threads.
Disadvantage #3: Different OSs may have different Thread APIs.

Process Pool/Thread Pool¶

Main Concept: A fixed and bounded pool of processes or threads leases a process or a thread to a DBMS Worker to service requests.
Advantage #1: A pool is memory efficient by requiring fewer processes or threads.
Advantage #2: A pool can be resized dynamically according to request load.

Shared Data and Process Boundaries¶

Thread: Shared Address Space.
Process: Shared Memory.
Client/Server Process Communication: Disk I/O Buffers & Client Communication Buffers.
Database I/O Requests: A Buffer Pool is a shared, in-memory data structure that stages all reads and writes for a DBMS Worker through the use of handles to frames.
Log I/O Requests: A Log Tail is a shared, in-memory queue that is periodically flushed by a separate process/thread acting as a Log Manager, yet it is appended by DBMS Workers through the use of inter-process communication or direct memory access.
Lock Requests: A Lock Table is a shared, in-memory data structure that implements database locking semantics; it is shared using similar semantics as a Buffer Pool.
Client Communication Buffers: A DBMS Worker enqueues tuples onto a socket to communicate results from a Server Process to a Client Process.

DBMS Threads¶

Assumption #1 - OS Thread Support: Small Number of Kernel Threads + Large Memory Overhead + Expensive Context Switches.
Assumption #2 - Uniprocessor Hardware: Single Machine with Single CPU.

DBMS Threads¶

A DBMS Thread is a lightweight thread programmed to manage its own state, to perform I/Os via non-blocking, asynchronous interfaces, and to frequently yield control to a dispatcher.
Advantage #1: DBMS Threads are more efficient than OS Threads.
Advantage #2: DBMS Threads are more portable to OSs with inefficient support for OS Threads.
Disadvantage #1: DBMS Threads have significant developer costs related to replicating OS logic.

Standard Practice¶

Databases	Process Per DBMS Worker	Thread Per DBMS Worker	Process Pool	Thread Pool
MySQL	No	Yes	No	Yes
PostgreSQL	Yes	No	No	No
IBM DB2	Yes	Yes	Yes	Yes
Oracle	Yes	No	Yes	No
Microsoft SQL Server	No	Yes	No	Yes

Admission Control¶

An Admission Control Policy rejects new requests until sufficient DBMS resources are available.
- Query Processing Techniques → Memory Pressure → Thrashing.
- Lock Contention → Transaction Failures → Thrashing.
The goal of an Admission Control Policy is to provide constant peak throughput while allowing transactional latencies to increase proportionally to the arrival rate.
Tier 1 Admission Control Policy: A Dispatcher Process can restrict the number of DBMS Clients and DBMS Workers.
Tier 2 Admission Control Policy: An Execution Admission Controller can estimate the resources that a query will require and the current availability of system resources to postpone or to restrict the query.
1. Estimate of I/O Load: Disk Devices + Random/Sequential I/O.
2. Estimate of CPU Load: Query Plan Operators.
3. Estimate of Memory Footprint: Internal Data Structures, Sorting, Hashing.

3. Parallel Architecture: Processes and Memory Coordination¶

Shared-Memory¶

A Shared-Memory Parallel System is a system in which all processors and cores can access the same RAM and disk with equal performance.
Advantage #1: All Uniprocessor Process Models support shared-memory hardware architectures to support the execution of multiple, independent SQL requests in parallel.
Disadvantage #1: The main challenge is to modify Relational Query Processor to parallelize a single query across multiple CPUs.

Shared-Nothing¶

A Shared-Nothing Parallel System is a system in which a cluster of independent machines communicate over a high-speed network interconnect.
Advantage #1: Commonly, Shared-Memory Process Models are ran on each independent machine with the caveat of Horizontal Data Partitioning.
Advantage #2: The Query Optimizer decides how to logically partition workloads across each independent machine to minimize communication of intermediate results when satisfying a query; hence, minimizing required coordination.
Disadvantage #1: A significant burden is placed on the Database Administrator to choose intelligent partitioning schemes.
Disadvantage #2: As a Distributed System, Shared-Nothing Parallel System suffers common difficulties and problems such as availability-vs-consistency trade-offs, coordinated transaction completion, distributed deadlock detection, redundancy schemes, and partial failures.

Shared-Disk¶

A Shared-Disk Parallel System is a system in which all processors can access the disks (e.g., Storage Area Networks) with equal performance, but are unable to access each other's RAM.
Advantage #1: Shared-Disk Parallel Systems have lower cost of administration.
Advantage #2: The failure of a single DBMS processing node does not affect the other nodes' ability to access the entire database.
Disadvantage #1: The storage subsystem is a single point of failure.
Disadvantage #2: Explicit coordination of data sharing across machines is needed to satisfy queries.
- Distributed Lock Manager + Cache-Coherency Protocol for Distributed Buffer Pools.

NUMA¶

A Non-Uniform Memory Access (NUMA) System is a system in which a shared-memory programming model is provided over a cluster of systems with independent memories with non-uniform, local and remote memory access times.
Advantage #1: A Non-Uniform Memory Access (NUMA) System is a simpler combination between a Shared-Memory Parallel System and a Shared-Nothing Parallel System.
Disadvantage #1: Complex optimizations are required to avoid memory access bottlenecks.

DBMS Threads and Multi-processors¶

When DBMS Threads within Multiple Processes are used on multi-processors, DBMS Threads must be distributed and migrated to maximize physical parallelism while minimizing the per-process memory overhead.

Standard Practice¶

Shared-Memory: IBM DB2, Oracle, Microsoft SQL Server.
Shared-Nothing: IBM DB2, Informix, Tandem, NCR Teradata, Greenplum.
Shared-Disk: Oracle RAC, RDB, IBM DB2 for zSeries.

4. Relational Query Processor¶

A Relational Query Processor validates, optimizes, and executes a SQL statement through its Logical/Physical Query Plan.
Generally, a Relational Query Processing is a single-user, single-threaded task in which Concurrency Control is managed transparently by lower layers of the system.
Generally, Data Manipulation Language statements including SELECT, INSERT, UPDATE, and DELETE are processed by a Relational Query Processor.
Generally, Data Definition Language statements such as CREATE TABLE and CREATE INDEX are not processed by a Relational Query Processor; they are often implemented as explicit calls to the Storage Engine and the Catalog Manager.

Query Parsing and Authorization¶

Verify that the query is syntactically correct.
Resolve names and references according to the Catalog Manager.
1. Canonicalization: The SQL Parser canonicalizes table aliases/names into a fully qualified name of the form server.database.schema.table.
2. Attribute References: The SQL Parser invokes the Catalog Manager to ensure that attribute references are correct by disambiguating expressions and operators.
Convert the query into a Logical Query Plan for the Query Optimizer.
Verify that the user is authorized to execute the query.
1. Partial Authorization: The SQL Parser verifies that the user has appropriate DML/DDL permissions on the tables, functions, and other objects.
2. Deferred Authorization: The SQL Parser cannot enforce all security requirements such as row-level security without executing a Query Plan; thus, Full Authorization is often deferred.

Query Rewrite¶

A Query Rewriter simplifies and normalizes a Logical Query Plan without changing its semantics by relying only on the Catalog Manager.
View Expansion: A Query Rewriter recursively replaces all view references with the columns, the tables, and the predicates referenced by the views.
Constant Arithmetic Evaluation: A Query Rewriter simplifies constant arithmetic expressions.
Logical Rewriting of Predicates: A Query Rewriter logically rewrites predicates to short-circuit the Plan Executor using a SAT Solver, to improve the relevance of predicates to indexes, and to introduce Transitive Predicates beneficial for the Query Optimizer.
- Transitive Predicate: R.x < 10 AND R.x = S.y ⟹ R.x < 10 AND R.x = S.y AND S.y < 10
Semantic Optimization: A Query Rewriter consults the Catalog Manager to rewrite a query per the integrity constraints on the tables being compatible with query predicates.
Subquery Flattening and Other Heuristic Rewrites: A Query Rewriter flattens subqueries and uses other heuristics to rewrite semantically equivalent Logical Query Plans to maximally expose opportunities for the Query Optimizer.

Query Optimizer¶

A Query Optimizer transforms a Logical Query Plan into an efficient Physical Query Plan for execution.
A Logical Query Plan is decomposed into SELECT-FROM-WHERE query blocks which are individually optimized before recomposition.
A Physical Query Plan can be represented by the compilation of machine code (System R) or interpretation of physical operators (INGRES): a tradeoff between performance and portability.
System R Optimizer: P. G. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and T. Price, “Access path selection in a relational database management system,” in Proceedings of ACM-SIGMOD International Conference on Management of Data, pp. 22–34, Boston, June 1979.
Plan Space: The System R Optimizer neglected right-deep Query Plans and the early use of Cartesian products, unlike the state-of-the-art.
Selectivity Estimation: The System R Optimizer relied on naive table and index cardinalities; the state-of-the-art analyze, sample, and summarize distributions/statistics regarding attributes.
Search Algorithms: The System R Optimizer used dynamic programming to enumerate optimal JOIN Orderings; the state-of-the-art use goal-directed, top-down search algorithms.
Parallelism: The state-of-the-art support intra-query parallelism, so the state-of-the-art generally use a Two-Phase Query Optimizer to schedule sequential and parallel operators.
1. A Single-System Query Optimizer chooses the best Single-System Physical Query Plan.
2. The Single-System Physical Query Plan is scheduled across multiple processors or machines.
Auto-Tuning: The state-of-the-art supports the ability for a DBMS to make tuning decisions automatically.

A Note on Query Compilation and Recompilation¶

Preparation: A SQL statement can be prepared such that a Physical Query Plan is cached for subsequent executions.
Advantage #1: Preparation improves the latency when executing a SQL statement.
Disadvantage #1: Selectivity Estimation is performed assuming "typical" values which may be adversarial to dynamic queries.
Disadvantage #2: Cached Physical Query Plans must be eventually re-optimized as the database changes over time.

Query Executor¶

A Query Executor executes a Physical Query Plan by interpreting compiled, low-level op-codes (Rare) or invoking procedures for the operators encapsulated in a directed dataflow graph (Common).
The Iterator is a model for executing a directed dataflow graph by implementing all operators as subclasses of the classic object-oriented Iterator.

// Abstract Iterator; Additional Optional Members in Descendants.
typedef struct Iterator
{
    size_t inputsLen;
    Iterator *inputs[];
} Iterator;

// Allocate Resources.
void Iterator_Initialize(Iterator *iterator);

// Deallocate Resources.
void Iterator_Close(Iterator *iterator);

// Null-Terminated Yield.
TupleDescriptor *Iterator_Next(Iterator *iterator);

Iterator Discussion¶

Advantage #1: Every Iterators' logic is independent of its children and parents in the directed dataflow graph.
Advantage #2: Iterators combine the flow of data and the flow of control, so a single Iterator_Next() call on the root of the directed dataflow graph is sufficient; no intermediary queues or rate-matching between Iterators are required.
Advantage #3: At least one Iterator in a directed dataflow graph is always active; thus, resource utilization is maximized.
Advantage #4: To support parallel query execution, no fundamental changes are required to the Iterator model or query execution architecture. Parallelism and network communications can be encapsulated within Exchange Iterators.

Where's the Data?¶

Generally, an Iterator is pre-allocated a fixed number of Tuple Descriptors, one for each of its inputs and outputs.
A Tuple Descriptor is an array of column references, where each column reference is composed of a reference to a tuple somewhere else in memory, and a column offset in that tuple.
A BF-Tuple is a tuple allocated on the Buffer Pool and referenced through the pin count of the tuple's page.
A M-Tuple is a tuple allocated on the Heap to support copying of BF-Tuples and evaluating expressions.

Data Modification Statements¶

Batch Read-Then-Write Scheme: Interpose Record-ID Materialization and Fetching Operators between Read Operators and Write Operators to ensure a DML statement cannot see its own modifications (i.e., Halloween Problem).
1. The Materialization and Fetching Operator materializes the RIDs of all tuples from the prior Read Operators into a temporary file.
2. The Materialization and Fetching Operator fetches each physical tuple ID by RID for the later Write Operators.

Access Methods¶

Access Methods are routines that manage access to various disk-based data structures the DBMS supports.
- e.g., Heap Files, B+-Tree Indexes, Hash Indexes, R-Tree Multi-Dimensional Indexes, RD-Tree Text Indexes, and Bloom Filter Indexes.
An Access Method's API is an extension of the Iterator API.

Search Arguments (SARGs)¶

Traditionally, Search Arguments (SARGs) of the form Column Operator Constant are accepted by the Iterator_Initialize() routine.
By passing SARGs into the Access Methods layer, SARGs provide a clean architectural boundary between the Storage Engine and the Relational Engine while obtaining excellent performance (by improving indexes and avoiding pin/unpin or copy/delete calls).

Row IDs (RIDs)¶

Traditionally, Row IDs (RIDs) are the physical disk addresses of the rows in base tables that allow indexes to reference rows.
By equating RIDs and physical disk addresses, RIDs are fast for referencing rows but slow for moving rows; Row Movement is prevalent when a page becomes full or a B+-tree is split.
DB2, Microsoft SQL Server and Oracle handle Row Movement by using some of the following methods: Forwarding Pointers, Avoiding Physical Disk Addresses, Avoiding B+-Trees as Primary Storage, Allowing Rows to Span Multiple Pages, and Unique Row Primary Keys.

Data Warehouses¶

A Data Warehouse is a large historical database for decision-support that are loaded with new data on a periodic basis.

Bitmap Indexes¶

For columns of small cardinalities, Bitmap Indexes can be succinctly constructed to be utilized in sophisticated bitmap arithmetic to improve the performance of common analytics queries.

Fast Load¶

As a Data Warehouse loads a large batch of records, Bulk Loaders are provided to stream these records into storage without the full overhead of Relational Query Processor.
As a Data Warehouse loads a fast stream of records, MVCC is provided to avoid updates-in-place and to support historical queries.

Materialized Views¶

A Materialized View is an actual table that corresponds to a logical view expression over the base tables.
Advantage #1: A Materialized View improves the performance of analytical queries by avoiding the runtime evaluation of the logical view expression.
Disadvantage #1: A Materialized View must be kept up to date as updates are performed.
The three aspects to the use of Materialized Views are the following: Selecting Views, Maintaining Freshness, and Supporting Ad-hoc Queries.

OLAP and Ad-hoc Query Support¶

An OLAP Cube is a multi-dimensional materialized view of aggregates that provides high performance for predictable queries.

Optimization of Snowflake Schema Queries¶

A Fact Table is a table of facts, measurements, or metrics.
A Dimension Table is a table of attributes that distinguish facts; these attributes are commonly categorical, geographical, nominal, ranged, or temporal.
A Star Schema consists of a central Fact Table surrounded by Dimension Tables, each with a 1-N primary-key-foreign-key relationship to the Fact Table.
A Snowflake Schema is multi-level Star Schema with hierarchical Dimension Tables.
Data Warehouse queries entail filtering one or more dimensions in a Snowflake Schema on some attributes in these tables, then joining the result to a Fact Table, grouping by some attributes in a Fact Table or a Dimension Table, and then computing an aggregate.

Data Warehousing: Conclusions¶

A special purpose Query Optimizer should focus on aggregate queries over Snowflake Schemas.
A Column Store is a DBMS that stores each column of a row separately; Column Stores reduce I/O costs by accessing fewer columns and enabling better disk compression.

Database Extensibility¶

Abstract Data Types¶

Abstract Data Types can be supported by mapping types to methods in the Catalog Manager such that new types and methods can be registered without drastically affecting the SQL Parser.
Abstract Data Types can be efficient by postponing selections until after joins and allowing functional indexes on expressions over ADTs.

Structured Types and XML¶

Structured Types refer to non-relational types such as arrays, sets, trees, and XML.
A DBMS can support Structured Types by the following three approaches: Custom DBMS, Structured Types as ADTs, Normalizing Nested Structured Types.

Full-Text Search¶

Traditionally, a DBMS has poor support for Full-Text Search when compared to a Text Indexing Engine.
Most DBMSs contain a separate Text Indexing Engine/Subsystem that bypasses transactional logic.

Additional Extensibility Issues¶

Query Optimizers: Proposals have rule-drive subsystems that generate or modify Query Plans, and allow new optimization rules to be registered independently.
Remote Data Sources: DBMSs have support for wrapping remote data sources like native tables.

Standard Practice¶

All RDMBSs are similar to System R.
PostgreSQL has a sophisticated Query Processor with a traditional Cost-Based Query Optimizer.
MySQL has a simple Query Processor with a focus on nested-loop joins over indexes.

5. Storage Management¶

A DBMS can interact directly with disks by using the low-level block-mode device drivers.
A DBMS can interact indirectly with disks by using standard OS file system facilities.

Spatial Control¶

$$ \text{Sequential Access} \gg \text{Random Access} $$

As a DBMS can deeply analyze its workload access patterns, it should exercise full control over the spatial positioning of database blocks on disk.
If a DBMS interacts directly with disks by using the Raw Disk Access Interfaces, then it can have the best spatial locality of its data.
Advantage #1: Raw Disk Addresses correspond closely to physical proximity of storage locations.
Disadvantage #1: As entire disk partitions must be devoted to a DBMS, utilities that need a file system interface are unavailable.
Disadvantage #2: As raw disk access interfaces are OS-specific, a DBMS can become more difficult to port.
Disadvantage #3: As Virtual Disk Devices intercept calls to Raw Disk Access Interfaces, a DBMS has less control over the spatial positioning of database blocks on disk.
Disadvantage #4: As the offsets in a large file in an OS file system exhibits reasonably good spatial locality, the degradation between large file access and direct raw access is minimal.

Temporal Control: Buffering¶

If a DBMS uses an OS file system, OS Buffering can silently postpone or reorder writes, negatively affecting the DBMS.
Problem #1: A DBMS cannot guarantee atomic recovery after software or hardware failure without explicitly controlling the timing and ordering of disk writes.
Problem #2: Predictive features such as Read-Ahead and Write-Behind provided by an OS file system are oblivious to the higher-level semantics of a DBMS, inhibiting performance optimizations.
Problem #3: Double Buffering - a DBMS must perform its own buffering for correctness, so OS Buffering is redundant as Memory Copying is a major bottleneck.

Buffer Management¶

A Buffer Pool is an array of Frames, a region of memory the size of a database disk block.
A Buffer Pool is associated with a Hash Table:
1. Current Page Numbers → Frame Table Locations.
2. Current Page Numbers → Backing Disk Storage Page Locations.
3. Current Page Numbers → Page Metadata (Dirty Bit + Pin Count).
  - A Dirty Bit indicates whether a page has changed since it was read from disk.
  - A Pin Count indicates whether a page is eligible for the Page-Replacement Policy.
OS Page-Replacement Policies such as LRU and CLOCK perform poorly for many database access patterns.
Traditionally, DBMSs employ enhancements to LRU to account for degenerative database access patterns when the Buffer Pool is full.
Advantage #1: Blocks are copied in and out of memory without translation to avoid CPU bottlenecks.
Advantage #2: Blocks are fixed-size to avoid memory-management complexities of external fragmentation and compaction.
Advantage #3: Blocks can be pinned to improve the performance of the Catalog Manager.

Standard Practice¶

A DBA creates a file system on each logical volume in the DBMS.
The DBMS allocates a single large file in each of the file systems and controls placement of data within that file with low-level interfaces like mmap suite.
The DBMS treats each logical volume as a linear array of contiguous database pages.

6. Transactions: Concurrency Control and Recovery¶

The Transactional Storage Manager encompasses the following components.
1. A Lock Manager for concurrency control.
2. A Log Manager for recovery.
3. A Buffer Pool for staging database I/Os.
4. Access Methods for organizing data on disk.
ARIES: C. Mohan, D. J. Haderle, B. G. Lindsay, H. Pirahesh, and P. M. Schwarz, “Aries: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging,” ACM Transactions on Database Systems (TODS), vol. 17, pp. 94–162, 1992.
Transaction Index Concurrency: M. Kornacker, C. Mohan, and J. M. Hellerstein, “Concurrency and recovery in generalized search trees,” in Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 62–72, Tucson, AZ, May 1997.

A Note on ACID¶

Atomicity: Either all operations of the transactions are reflected properly in the database, or none are.
- Guaranteed by Locking + Logging.
Consistency: Execution of a transaction in isolation preserves the consistency of the database.
- Guaranteed by Runtime Checks in Query Executor.
Isolation: Even though multiple transactions may execute concurrently, the system guarantees that, for every pair of transactions $T_i$ and $T_j$, it appears $T_i$ that either $T_j$ finished execution before $T_i$ started or $T_j$ started execution after $T_i$ finished. Thus, each transaction is unaware of other transactions executing concurrently in the system.
- Guaranteed by Locking + Logging.
Durability: After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.
- Guaranteed by Logging + Recovery.

A Brief Review of Serializability¶

Serializability guarantees that a sequence of interleaved actions for multiple committing transactions must correspond to some serial execution of the transactions.

Strict Two-Phase Locking (2PL): Transactions acquire a Shared Lock per Read and an Exclusive Lock per Write in a Growing Phase, then all locks are released in a Shrinking Phase.
Multi-Version Concurrency Control (MVCC): Transactions do not hold locks, but instead are guaranteed a consistent view of the database state at some time in the past, even if rows have changed since that fixed point in time.
Optimistic Concurrency Control (OCC): Multiple transactions are allowed to read and to update an item without blocking. Instead, transactions maintain histories of their reads and writes, and before committing a transaction checks their history for isolation conflicts they may have occurred; if any are found, one of the conflicting transactions is rolled back.

Locking and Latching¶

A Lock allows a DBMS to register and to check for physical/logical items (e.g., Disk Pages, Tuples, Tables, Indexes) with a specified lock-mode, under a unique transaction ID.
A Lock Manager supports the following API.
- lock(lockName, transactionId, mode)
- unlock(lockName, transactionId)
- upgrade(lockName, transactionId, newMode)
- conditionalLock(lockName, transactionId, mode)
- removeTransaction(transactionId)
A Lock Manager uses a Global Lock Table, a Transaction Table, and a Deadlock Detector.
A Global Lock Table is a dynamic hash table of the following.
- Lock Names → Lock Modes.
- Lock Names → Wait Queues of (Transaction ID, Lock Mode).
A Transaction Table is a dynamic hash table of the following.
- Transaction IDs → Transaction's Thread States.
  - Facilitates Rescheduling.
- Transaction IDs → Pointer List to Transaction's Lock Requests.
  - Facilitates Removal.
A Deadlock Detector is a DBMS thread that periodically examines the Lock Table to detect waits-for-cycles and to abort deadlocked transactions.
A Latch allows a DBMS to provide mutual exclusion for internal data structures.
- latch(object, mode)
- unlatch(object, mode)
- conditionalLatch(object, mode)

Transaction Isolation Anomalies¶

Phantom Read: The results of a query in one transaction are changed by another transaction before the former commits.
Non-Repeatable Read: Repeated reads of the same record in one transaction return different values because of an update made by another transaction.
Dirty Read: One transaction reads a value written by another transaction that has not yet committed.

Transaction Isolation Levels¶

ANSI SQL Isolation Levels¶

Isolation Level	Phantom Reads	Non-Repeatable Reads	Dirty Reads
Read Uncommitted	Yes	Yes	Yes
Read Committed	Yes	Yes	No
Repeatable Read	Yes	No	No
Serializable	No	No	No

READ UNCOMMITTED: A transaction may read any version of data, committed or not.
- This is achieved by read requests proceeding without acquiring any locks.
READ COMMITTED: A transaction may read any committed version of data. Repeated reads of an object may result in different committed versions.
- This is achieved by read requests acquiring a read lock before accessing an object, and unlocking it immediately after access.
REPEATABLE READ: A transaction will read only one version of committed data; once the transaction reads an object, it will always read the same version of that object.
- This is achieved by read requests acquiring a read lock before accessing an object, and holding the lock until end-of-transaction.
SERIALIZABLE: Full serializable access is guaranteed.

Other Isolation Levels¶

CURSOR STABILITY: A transaction isolation level above READ COMMITTED that allows a transaction to do read-think-write sequences on individual items without intervening updates from other transactions.
SNAPSHOT ISOLATION (MVCC): A transaction isolation level above REPEATABLE READ that allows a transaction to operate on a version of the database as it existed at the time the transaction began.
READ CONSISTENCY (MVCC): A transaction isolation level similar to REPEATABLE READ that allows each statement in a transaction to see the most recently committed values as of the start of the statement.

Log Manager¶

A Log Manager is responsible to maintain the durability of committed transactions, to facilitate the rollback of aborted transactions, to ensure atomicity, and to recover from system failure or non-orderly shutdown.
Traditionally, Database Recovery uses a Write-Ahead Logging (WAL) protocol.
1. Each modification to a database page should generate a Log Record, and the Log Record must be flushed to the log device before the database page is flushed.
  - Ensures Reversibility of Aborted Transactions.
2. Database Log Records must be flushed in order; Log Record $R$ cannot be flushed until all Log Records preceding $R$ are flushed.
  - Ensures Durability of Committed Transactions.
3. Upon a transaction commit request, a Commit Log Record must be flushed to the log device before the commit request returns successfully.
  - Ensures Durability of Committed Transactions.
Challenge #1: Fast Commits.
Challenge #2: Fast Rollbacks.
Challenge #3: Fast Recovery.
Traditionally, DBMSs operate in DIRECT, STEAL/NOT-FORCE for $\text{Fast Commits} \gg \text{Fast Rollbacks} + \text{Fast Recovery}$.
1. Data objects are updated in place.
2. Unpinned buffer pool frames can be stolen even if they contain uncommitted data.
3. Buffer pool pages need not be forced (flushed) to the database before a commit request returns to the user.
Traditionally, DBMSs prefer Physical Logging over Logical Logging.
Crash Recovery is a process that restores the database to a consistent state after a system failure or non-orderly shutdown.
1. Start at the Recovery Log Sequence Number checkpointed as the oldest of the Log Record describing the earliest change to the oldest dirty page in the Buffer Pool and the Log Record representing the start of the oldest transaction.
2. Redo Pass (Committed Transactions): Using a Log Record, set the specified item to the new value.
3. Undo Pass (Aborted/Incomplete Transactions): Using a Log Record, set the specified item to the old value.

Locking and Logging in Indexes¶

Indexes are physical storage structures for accessing data in the database for consistency and efficiency.

Latching in B+-Trees¶

Conservative Schemes: Multiple transactions wanting to access the same pages are allowed only if they can be guaranteed not to conflict in their use of a page's content.
Latch-Coupling Schemes (Latch Crabbing): The tree traversal logic latches each node before its visited, only unlatching a node when the next node to be visited has been successfully latched.
Right-Link Schemes (B-Link Tree): Simple additional structures are added to the B+-tree to minimize the requirement for latches and re-traversals.

Logging for Physical Structures¶

Many changes to physical structures do not need to be undone when a transaction is aborted, so Redo-Only Log Records or Nested Top Actions (Skipping) can be employed to improve the performance of Crash Recovery.

Next-Key Locking: Physical Surrogates for Logical Properties¶

Main Challenge: Provide full serializability while allowing for tuple-level locks and the use of indexes.
Problem: Phantoms can arise when a transaction accesses tuples via an index, and another transaction inserts new tuples after the accessed tuples.
Solution: Next-Key Locking requires that the insertion of a tuple with index key $K$ must allocate an exclusive lock on the next-key tuple that exists in the index, where the next-key tuple has the lowest key greater than $K$. Next-Key Locking requires read transactions to get a shared lock on the next-key tuple in the index as well.
Key Insight: A physical object (i.e., Tuple) becomes a surrogate for a logical object (i.e., Predicate).

Interdependencies of Transactional Storage¶

$$ \text{Concurrency Control} \Leftrightarrow \text{Recovery Management} \Leftrightarrow \text{Access Methods} $$

Recovery Management's WALs make implicit assumptions about Concurrency Control's Locking Protocol.
Access Methods are difficult to implement abstractly and correctly with varying Concurrency Control and Recovery Management.

Standard Practice¶

Traditionally, DBMSs support ACID transactions while using Two-Phase Locking for Concurrency Control and Write-Ahead Logging for Durability.
PostgreSQL and Oracle exclusively and partially support Multi-Version Concurrency Control, respectively.
MySQL supports various Storage Managers such as MyISAM and InnoDB.

7. Shared Components¶

Catalog Manager¶

A Catalog holds metadata about the users, schemas, tables, columns, and indexes within a DBMS.
Traditionally, a Catalog is stored as set of tables in a DBMS such that users can employ the same languages and tools used for regular data.
Traditionally, a Catalog is partially denormalized into a network of in-memory data structures for efficiency.

Memory Allocator¶

A Memory Context is an in-memory data structure that maintains a list of regions of contiguous virtual memory (Memory Pools).
A Memory Context's region can have a small header that contains either a context label or a pointer to the context header structure.
A Memory Context has the following API.
1. Create a Context with a Given Name/Type: The type might advise the allocator how to efficiently handle memory.
2. Allocate a Chunk of Memory within a Context: This allocation will return a pointer to memory.
3. Delete a Chunk of Memory within a Context: This may or may not cause the context to delete the corresponding region.
4. Delete a Context: This first frees all of the regions associated with the context, and then delete the context header.
5. Reset a Context: This retains the context, but returns it to the state of original creation, typically by deallocating all previously allocated regions of memory.
Advantage #1: Memory Contexts are a low-level programmer-controllable alternative to garbage collection.
Advantage #2: Memory Contexts avoid traditional memory leaks by deallocating potential memory leaks upon deletion.
Advantage #3: Memory Contexts provide spatial control and temporal control to the programmer.
Advantage #4: Memory Contexts are more efficient on platforms in which the overhead for malloc() and free() is relatively high.

A Note on Memory Allocation for Query Operators¶

Some DBMSs allow DBAs to specify the amount of RAM for Query Operators.
Some DBMSs manages allocations automatically without the intervention of DBAs.

Disk Management Subsystems¶

A Disk Management Subsystem manages the allocations of tables and other units of storage across physical volumes, logical volumes, or OS files.
A Disk Management Subsystem is responsible for mapping tables to devices/files, handling issues related to files being larger than disks, to the OS allowing too few OS files and OS file descriptors, and to the OS limiting the maximum OS file size.
A Disk Management Subsystem has to manage a DBMS's interactions with RAID systems and SAN systems.

Replication Services¶

Physical Replication: This scheme physically duplicates the entire database; however, it is difficult for this scheme to guarantee a transactionally consistent snapshot of the entire database.
Trigger-Based Replication: This scheme has triggers on tables to capture and to replay DML statements from the primary database to a secondary database; however, it is difficult for this scheme to guarantee a transactionally consistent snapshot of the entire database.
Log-Based Replication: This scheme has a Log Sniffer process to capture and to replay modifications of the Write-Ahead-Log; this scheme provides a low-overhead mechanism to guarantee a transactionally consistent snapshot of the entire database.

Administration, Monitoring, and Utilities¶

Optimizer Statistics Gathering: DBMSs allow DBAs to scan tables to build optimizer statistics.
Physical Reorganization and Index Construction: DBMSs allow DBAs to request that tables to be reorganized/reclustered/repartitioned.
Backup/Export: DBMSs allow DBAs to physically backup and export a database to and from secondary storage.
Bulk Load: DBMSs allow DBAs to use an optimized utility to import large volumes of data.
Monitoring, Tuning, and Resource Governors: DBMSs allow DBAs to identify and to prevent SQL statements that consume more resources than desirable.