ECE 454 - Distributed Computing¶

Introduction¶

Distributed System¶

Distributed System: A collection of autonomous computing elements that appears to its users as a single coherent system.

Motivations for Distributed Systems¶

Resource Sharing
Simplify Processes by Integrating Multiple Systems
Limitations in Centralized Systems: Weak/Unreliable
Distributed/Mobile Users

Goals for Distributed Systems¶

Resource Sharing
- CPUs, Data, Peripherals, Storage.
Transparency
- Access, Location, Migration, Relocation, Replication, Concurrency, Failure.
Open
- Interoperability, Composability, Extensibility.
Scalable
- Size, Geography, Administration.

Types of Distributed Systems¶

Web Services
High Performance Computing, Cluster Computing, Cloud Computing, Grid Computing
Transaction Processing
Enterprise Application Integration
Internet of Things, Sensor Networks

Middleware¶

Middleware: A layer of software that separates applications from the underlying platforms.
- Supports Heterogeneous Computers/Networks.
- e.g.: Communication, Transactions, Service Composition, Reliability.
- Single-System View

Scaling Techniques¶

Hiding Communication Latencies: At Server vs. At Client?
Partitioning
Replication

Fallacies of Networked and Distributed Computing¶

Network is reliable.
Network is secure.
Network is homogeneous.
Topology is static.
Latency is zero.
Bandwidth is infinite.
Transport cost is zero.
There is only one administrator.

Shared Memory vs. Message Passing¶

Shared Memory:
- Less Scalable
- Faster
- CPU-Intensive Problems
- Parallel Computing
Message Passing:
- More Scalable
- Slower
- Resource Sharing / Coordination Problems
- Distributed Computing
Apache Hadoop is an example of a hybrid computing framework that uses message passing at a broad-view and shared memory at a detailed-view.

Cloud and Grid Computing¶

IaaS: Infrastructure as a Service
- VM Computation, Block File Storage
PaaS: Platform as a Service
- Software Frameworks, Databases
SaaS: Software as a Service
- Web Services, Business Apps

Transaction Processing Systems¶

Transaction Processing Monitor: Coordinates Distributed Transactions

Architectures¶

Definitions¶

Component: A modular unit with well-defined interfaces.
Connector: A mechanism that mediates communication, coordination, or cooperation among components.
Software Architecture: Organization of software components.
System Architecture: Instantiation of software architecture in which software components are placed on real machines.
Autonomic System: Adapts to its environment by monitoring its own behavior and reacting accordingly.

Architectural Styles¶

Layered
- Note: Assignment Topic
Object-Based
Data-Centered
Event-Based

Layered Architecture¶

Layers

Examples:
- Database Server, Application Server, Client
- SSH Server, SSH Client
Requests Flow Down Stack
Responses Flow Up Stack
Handle-Upcall: Async Notification
- Subscribe with Handle
- Publish with Upcall

Client-Server Interactions¶

Client-Server Interactions

Bolded Lines = Busy
Dashed Lines = Idle
Client: Initiates with a Request
Server: Follows with a Response
Total Round-Trip Time: $(N - 1) \times t_{\text{Request-Response}}$
- Layering can reduce the amount of processing time per layer, but the additional communication overhead between the layers introduces diminishing returns.
An intermediate layer can be both a client and a server to the others.

Multi-Tiered Architecture¶

Logical Software Layers $\mapsto$ Physical Tiers
- Trade-Offs: Ease of Maintenance vs. Reliability

Horizontal vs. Vertical Distribution¶

Vertical Distribution: When the logical layers of a system are organized as separate physical tiers.
- Performance: High.
- Scalability: Low.
- Dependability: Low-Medium.
Horizontal Distribution: When one logical layer is split across multiple machines - sharding.
- Performance: Low.
- Scalability: High.
- Dependability: Medium-High.

Object-Based Architecture¶

In an object-based architecture, components communicate using remote object references and method calls.

Problems with Object-Based Architecture¶

Complex Communication Interfaces
Complex Communication Costs
Not Scalable
Not Language Agnostic

Data-Centered Architecture¶

In a data-centered architecture, components communicate by accessing a shared data repository.

Event-Based Architecture¶

Publish/Subscribe Middleware

In an event-based architecture, components communicate by propagating events using a publish/subscribe system.

Handling Asynchronous Delivery Failure¶

At-Least Once Delivery: Do Retransmit
At-Most Once Delivery: Do Not Retransmit
Exactly Once Delivery: Unknown/Unachievable

Peer-to-Peer Systems¶

Chord's Finger Table

In a peer-to-peer system, decentralized processes are organized in an overlay network that defines a set of communication channels.
In a peer-to-peer, distributed hash table, a keyspace is represented by a consistent hash ring on top of which nodes partition ranges amongst themselves.
The mappings of partition ranges to nodes are maintained by a finger table which can be queried in a logarithm process.

Hybrid Architectures¶

BitTorrent is an example of a hybrid architecture combining a client-server architecture and a peer-to-peer architecture.

Self-Management¶

Self-Management Systems

In self-management, systems use a feedback control loop that monitors system behaviors and adjusts system operations.
Assignment Note: Useful for Unknown Assignment

Processes¶

IPC¶

Inter-Process Communication (IPC): Expensive b/c Context Switching

Threads¶

Typically, an operating system kernel support multi-threading through lightweight processes (LWP).
Assignment Note: Do Not Spawn Too Many Threads

Multi-Threaded Servers¶

Dispatcher/Worker Design: A dispatcher thread receives requests from the network and feeds them to a pool of worker threads.
Assignment Note: Useful for Assignment 1 & Partition into Sequential Work and Parallel Work

Hardware and Software Interfaces¶

Hardware and Software Interfaces

Virtualization¶

VMs

Advantage:
- Portability
- Live Migration of VMs
- Replication for Availability/Fault Tolerance
Disadvantage:
- Performance

Server Clusters¶

Three Physical Tier

Assignment Note: Useful for Assignment 2

Communication¶

Layered Network Model¶

Layered Network Model

Remote Procedure Calls¶

Remote Procedure Calls: A transient communication abstraction implemented using a client-server protocol.
Client Stub: Translate a RPC on the client.
Server Stub: Translate a RPC on the server.

Steps of a RPC¶

Steps of a RPC

The client process invokes the client stub using an ordinary procedure call.
The client stub builds a message and passes it to the client's OS.
The client's OS sends the message to the server's OS.
The server's OS delivers the message to the server stub.
The server stub unpacks the parameters and invokes the appropriate service handler in the server process.
The service handler does the work and returns the result to the server stub.
The server stub packs the result into a message and passes it to the server's OS.
The server's OS sends the message to the client's OS.
The client's OS delivers the message to the client stub.
The client stub unpacks the result and returns it to the client process.

Parameter Marshalling: Packing Parameter $\to$ Message
- Processor Architectures, Network Protocols, and VMs $\implies$ Little-Endian vs. Big-Endian
Number of System Calls: 4
1. Client Process $\to$ Client OS Socket
2. Server OS Socket $\to$ Server Process
3. Server Process $\to$ Server OS Socket
4. Client OS Socket $\to$ Client Process

Defining RPC Interfaces¶

Interface Definition Language (IDL): Specify RPC Signatures $\to$ Client/Server Stubs
- High-Level Format
- Parameter Ordering
- Byte Sizes

Synchronous vs. Asynchronous RPCs¶

Synchronous RPC: The client blocks to wait for the return value.
Asynchronous RPC: The client blocks to wait for the server acknowledgement of the receipt of the request.
One-Way RPC: The client does not block to wait.

Message Queuing Model¶

Message Queue Interface

Message Queue: Alternative to RPCs
Persistent Communication: Loose Coupling between Client/Server
- Advantage: Resilient to Client/Server Hardware Failure
- Disadvantage: Guaranteed Delivery = Impossible
Message-Oriented Middleware (MOM): Asynchronous Message Passing

Process Coupling¶

Referential Coupling: When one process explicitly references another.
- Positive Example: RPC client connects to server using an IP address and a port number
- Negative Example: Publisher inserts a news item into a pub-sub system without knowing which subscriber will read it.
Temporal Coupling: Communicating processes must both be up and running.
- Positive Example: A client cannot execute a RPC if the server is down.
- Negative Example: A producer appends a job to a message queue today, and a consumer extracts the job tomorrow.

RPC vs. MOM¶

RPC¶

Used mostly for two-way communication, particularly where the client requires immediate response from the server.
The middleware is linked into the client and the server processes.
Tighter coupling means that server failure can prevent client from making progress.

MOM¶

Used mostly for one-way communication where one party does not require an immediate response from another.
The middleware is a separate component between the sender/publisher/producer and the receiver/subscriber/consumer.
Looser coupling isolates one process from another which contributes to flexibility and scalability.

Distributed File Systems¶

Accessing Remote Files¶

DFS Models

Remote Access Model
Upload/Download Model

Network File System (NFS)¶

Overview of NFS

Supports Client-Side Caching
- Modifications are flushed to the server when the client closes the file.
- Consistency is implementation dependent.

Authority Delegation

Supports Authority Delegation
- A server can delegate authority to a client and recall it through a callback mechanism.

Compound Procedure

Supports Compound Procedures
- Multiple Round Trips to Single Round Trip

Partial Exports

Supports Partial Exports

Google File System (GFS)¶

Google File System

GFS: A distributed file system that stripes files across inexpensive commodity servers without RAID.
- Layered Above Linux File System
- Fault Tolerance Through Software
GFS Master: Stores Metadata About Files/Chunks
- Metadata Cache in Main Memory
- Updated Log in Local Storage
- Periodically Polls Client Servers for Consistency

Reading a File¶

A client sends the file name and chunk index to the master.
The master responds with a contact address.
The client then pulls data directly from a chunk server, bypassing the master.

Updating a File¶

The client pushes its updates to the nearest chunk server holding the data.
The nearest chunk server pushes the update to the next closest chunk server holding the data, and so on.
When all replicas have received the data, the primary chunk server assigns a sequence number to the update operation and passes it on to the secondary chunk servers.
The primary replica informs the client that the update is complete.

File Sharing Semantics¶

File Sharing Semantics

Apache Hadoop MapReduce¶

High-Level Architecture¶

Hadoop High-Level Architecture

Transform lists of input data elements into lists of output data elements by applying Mappers and Reducers
- Immutable Data
- No Communication

Mapper¶

A list of input data elements are iterated and individually transformed into zero or more output data elements.

Reducer¶

A list of input data elements are iterated and individually aggregated into a single output data element.

Combiner¶

An optional component that consumes the outputs of a mapper to produce a summary as the inputs for a reducer.

Terms¶

InputSplit: A unit of work assigned to one map task.
- Usually corresponds to a chunk of an input file.
- Each record in a file belongs to exactly one input split and the framework takes care of dealing with record boundaries.
InputFormat: Determines how the input files are parsed, and defines the input splits.
OutputFormat: Determines how the output files are formatted.
RecordReader: Reads data from an input split and creates key-value pairs for the mapper.
RecordWriter: Writes key-value pairs to output files.
Partitioner: Determines which partition a given key-value pair will go to.

Data Flow¶

MapReduce Data Flow 1

MapReduce Data Flow 2

Shuffle: The process of partitioning by reducer, sorting and copying data partitions from mappers to reducers.

Fault Tolerance¶

Primarily: Restart Failed Tasks
1. Individual TaskTrackers periodically emit a heartbeat to the JobTracker.
2. If a TaskTracker fails to emit a heartbeat to the JobTracker, the JobTracker assumes that the TaskTracker crashed.
3. If the failed node was mapping, then other TaskTrackers will be asked to re-execute all the map tasks previously run by the failed TaskTracker.
  - Must be Side-Effect Free
4. If the failed node was reducing, then other TaskTrackers will be asked to re-execute all reduce tasks that were in progress on the failed TaskTracker.
  - Must be Side-Effect Free
Secondarily: Speculative Execution
- If some straggler nodes rate limit the rest of the program, Hadoop will schedule redundant copies of remaining tasks across several nodes which do not have other work to perform.

MapReduce Design Patterns¶

Counts and Summations¶

A mapper can emit a tuple of an element and one for each element.
A mapper can aggregate the counts for each element and emit a tuple of the element and its count.
A combiner can aggregate the counts across all the elements processed by a mapper.

Selection¶

A mapper can emit a tuple for each element that satisfies a predicate.

Projection¶

A mapper can emit a tuple whose fields are a subset of each element.
A reducer can eliminate duplicates.

Inverted Index¶

A mapper can emit a tuple of a value and a key in that specific order.
A reducer can aggregate all the keys for a distinct value.

Cross-Correlation¶

Problem: Given a set of tuples of items, for each possible pair of items, calculate the number of tuples where these items co-occur.

Pairs Approach (Slow)¶

class Mapper
  method Map(void, items [i1, i2, ...])
    for all item i in [i1, i2, ...]
      for all item j in [i1, i2, ...] such that j > i
        Emit(pair [i, j], count 1)

class Reducer
  method Reduce(pair [i, j], counts [c1, c2, ...])
    s = sum([c1, c2, ...])
    Emit(pair [i, j], count s)

Stripes Approach (Fast)¶

class Mapper
  method Map(void, items [i1, i2, ...])
    for all item i in [i1, i2, ...]
      H = new AssociativeArray : item -> counter
      for all item j in [i1, i2, ...] such that j > i
        H{j} = H{j} + 1
      Emit(item i, stripe H)

class Reducer
  method Reduce(item i, stripes [H1, H2, ...])
    H = new AssociativeArray : item -> counter
    H = merge-sum([H1, H2, ...])
    for all item j in H.keys()
      Emit(pair [i, j], H{j})

Apache Spark¶

RDD¶

RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.

Lineage¶

Lineage

A lineage is a directed acyclic graph that expresses the dependencies between RDDs such that a RDD can be rebuilt in the event of a failure.

Transformation and Actions¶

Transformations: Operations that convert one RDDs or a pair of RDDs into another RDD.
Actions: Operations that convert a RDD into an output.

Narrow vs. Wide Dependencies¶

Dependencies

Execution

When an action is invoked on a RDD, the Spark scheduler examines the lineage graph of the RDD and builds a directed acyclic graph of transformations.
The transformations in the DAG are grouped into stages.
A stage is a collection of transformations with narrow dependencies, meaning that one partition of the output depends on only one partition of each input.
The boundaries between stages correspond to wide dependencies, meaning that one partition of the output depends on multiple partitions of some input, requiring a shuffle.

Distributed Graph Processing¶

Pregel¶

In Pregel, graph processing problems are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology.
This vertex-centric approach is flexible enough to express a broad set of algorithms.

Architecture¶

Unit of Work: A partition consisting of a set of vertices and their outgoing edges.
Coordination: A single master with many workers.

The master determines the number of partitions and distributes the partitions to each worker.
The master assigns a portion of the user's input to each worker.
1. If the worker is assigned a vertex that belongs to its partition of the graph, then the worker updates the the state of the vertex.
  - Vertex's Current Value
  - Vertex's Outgoing Edges
  - Vertex's Activity Flag
2. If the worker is assigned a vertex that does not belong to its partition of the graph, then the worker sends a message containing the vertex and its edges to the appropriate remote peer.
3. All the input vertices are marked as active.
The master instructs each worker to perform a superstep.
1. The worker loops to compute through its active vertices.
  1. Asynchronously execute a user-defined function on each vertex.
  2. Receive messages sent in the previous superstep.
  3. Send messages to be received in the next superstep.
  4. Vertices can modify their value.
  5. Vertices can modify the values of their edges.
  6. Vertices can add or remove edges.
  7. Vertices can deactivate themselves.
2. The worker notifies the master how many vertices will be active in the next superstep.
Repeat Step 3 until all the vertices are inactive.

Important Note: An active vertex is reactivated when it receives a message.
Bulk Synchronous Parallel: Workers compute asynchronously within each superstep, and communicate only between supersteps.

Combiners and Aggregators¶

Combiners: An optional component which reduces the amount of data exchanged over the network and the number of messages.
- i.e., A commutative and associative user-defined function.
Aggregators: An optional component which computes aggregate statistics from vertex-reported values.
1. Workers aggregate values from their vertices during each supersep.
2. At the end of each superstep, the values from the workers are aggregated in a tree structure, and the value from the root of the tree is sent to the master.
3. The master shares the value with all vertices in the next superstep.
An aggregator is useful for detecting convergence conditions for vertices to transition to the inactive state.

Fault Tolerance¶

Key Idea: Checkpointing
At the beginning of a superstep, the master instructs the workers to save the state of their partitions to persistent storage.
- Vertex Values.
- Edge Values.
- Incoming Messages.
The master separately saves the aggregator values.
Worker failures are detected using regular pings from the master to the workers.
When one or more workers fail, the master reassigns graph partitions to the currently available set of workers, and they all reload their partition state from the most recent available checkpoint.

Consistency and Replication¶

Motivations for Replication¶

Increased Reliability
Increased Throughput
Decreased Latency

Replicated Data Store¶

In a replicated data store, each data object is replicated at multiple hosts.
- Local Replica: Same Hosts
- Remote Replica: Different Hosts

Consistency Models¶

Sequential Consistency
Causal Consistency
Linearizability
Eventual Consistency
Session Guarantees

Sequential Consistency¶

A data store is sequentially consistent when the result of any execution is the same as if the operations by all processes on the data were executed in some sequential order and the operations of each individual process appear in this sequence in the order specified by its program.

Positive Sequential Consistency Example

Negative Sequential Consistency Example

R(x)a and R(x)b conflict.

Causal Consistency¶

A data store is causally consistent when writes related by the "causally precedes" relation must be seen by all processes in the same oder.
Concurrent writes may be seen in a different order on different machines.
"Causally precedes" is the transitive closure of two rules.
1. Operation A causally precedes operation B if A occurs before B in the same process.
2. Operation A causally precedes operation B if B reads a value written by A.
If operations A and B are concurrent (no "causally precedes"), then A and B can be read in either order.

Positive Causal Consistency Example

W(x)a causally precedes R(x)a.
R(x)a causally precedes W(x)b.
W(x)b and W(x)c are concurrent.
Therefore, the reads must occur in the following sequences:
1. A, B, C, or
2. A, C, B.

Negative Causal Consistency Example

W(x)a causally precedes R(x)a.
R(x)a causally precedes W(x)b.
However, P3's R(x)b before R(x)a violates causal consistency.

Linearizability¶

A data store is linearizable when the result of any execution is the same as if the operations by all processes on the data store were executed in some sequential order that extends the "happens before" relation.
If operation A finishes before operation B begins, then A must precede B in the sequential order.

Eventual Consistency¶

If no updates take place for a long time, all replicas will gradually become consistent.
Allows different processes to observe write operations taking effect in different orders, even when these write operations are related by "causally precedes" or "happens before".

Session Guarantees¶

Session Guarantees: Restrict the behavior of operations applied by a single process in a single session.
Monotonic Reads: If a process reads $x$, any successive reads on $x$ by that process will always return the same value or a more recent value.
Monotonic Writes: A write by a process on $x$ is completed before any successive write on $x$ by the same process.
Read Your Own Writes: The effect of a write operation by a process on $x$ will always be seen by a successive read on $x$ by the same process.
Writes Follow Reads: A write by a process on $x$ following a previous read on $x$ by the same process is guaranteed to take place on the same or a more recent value of $x$ that was read.

Primary-Based Replication Protocols¶

In a primary-based protocol, each $x$ in the data store has an associated primary, which is responsible for coordinating writes on $x$.
If the primary replica fails, then one of the backup replicas may take over as the new primary.
- Disadvantage: If the network is partitioned, the cluster can become split-brain such that one replica in each partition believes its the primary replica; hence, a divergence of state.

Advantages and Disadvantages¶

Advantages¶

Strong Consistency

Disadvantages¶

Performance Bottlenecks
Loss of Availability

Remote-Write Protocol¶

Remote-Write Protocol Example

Remote-Write: The primary replica is generally stationary and therefore must be updated remotely by other servers.

Local-Write Protocol¶

Local-Write Protocol Example

Local-Write: The primary replica migrates from server to server, allowing local updates.

Quorum-Based Replication Protocols¶

In a quorum-based protocol, all replicas are allowed to receive updates and reads, but operations are required to be accepted by a sufficiently large subset of replicas called a write quorum or a read quorum.

Requirements of Write and Read Quorums¶

Write and Read Quorums Examples

Let $N$ be the total number of replicas.
Let $N_{W}$ be the size of the write quorum.
Let $N_{R}$ be the size of the read quorum.
The following two rules must be satisfied:
1. $N_{R} + N_{W} > N$; Read and write quorums must overlap.
2. $N_{W} + N_{W} > N$; Two write quorums must overlap.
The first rule enables detection of read-write conflicts.
- Read-write conflicts occur when one process wants to update data item while another is concurrently attempting to read that item.
The second rule enables detection of write-write conflicts.
- Write-write conflicts occur when two processes want to perform an update on the same data.

Partial Quorums¶

Derivatives of Anamzon's Dynamo, allow various degress of consistency by tuning $N_{R}$ and $N_{W}$.
- Strong Consistency: $N_{R} + N_{W} > N$
- Weak Consistency: $N_{R} + N_{W} \le N$
Important Note 1: The strong consistency mode does not avoid write-write conflicts.
Important Note 2: The weak consistency mode does not avoid read-write conflicts or write-write conflicts.
To resolve write-write conflicts, updates are tagged with timestamps, and resolution policies such as last-write wins or vector clocks are applied.

Eventually-Consistent Replication¶

A server that receives an update replies with an acknowledgement to the client first, and then propagates the update lazily to the remaining replicas.
If a replica is unreachable, then it can be updated later using an anti-entropy mechanism.
- e.g., Replicas may periodically exchange hashes of data to detect discrepancies using Merkle trees.
- e.g., Updates can be timestamped to enable determination of the latest version of a data item.

Fault Tolerance¶

Dependability Requirements¶

Availability: The system should operate correctly at any given instant in time.
Reliability: The system should run continuously without interruption.
Safety: Failure of the system should not have catastrophic consequences.
Maintainability: A failed system should be easy to repair.

Definitions¶

Failure: When a system cannot fulfill its promises.
Error: Part of a system's state that may lead to a failure.
Fault: The cause of an error.

A fault may lead to an error, which may lead to a failure.

Types of Faults¶

Transient: Occurs once and disappears.
Intermittent: Occurs and vanishes, reappears.
Permanent: Continues to exist until faulty component is replaced.

Types of Failure¶

Crash Failure: A server halts, but is working correctly until it halts.
Omission Failure: A server fails to respond to incoming requests.
1. Receive Omission: A server fails to receive incoming messages.
2. Send Omission: A server fails to send outgoing messages.
Timing Failure: A server's response lies outside the specified time interval.
Response Failure: A server's response is incorrect.
1. Value Failure: The value of the response is wrong.
2. State Transition Failure: The server deviates from the correct flow of control.
Arbitrary Failure: A server may produce arbitrary responses at arbitrary times.

Failure Masking by Redundancy¶

Information Redundancy: Extra bits are added to allow recovery from garbled bits.
- e.g., Hamming Code
Time Redundancy: An action is performed, and then, if need be, it is performed again.
- e.g., Transactions, Idempotent Requests
Physical Redundancy: Extra equipment or processes are added to make it possible for the system as a whole to tolerate the loss of malfunctioning of some parts.

Resilience by Process Groups¶

Flat Groups and Hierarchical Groups

Protection against process failures can be achieved by replicating processes into groups.
When a message is sent to the group itself, all members of the group receive it.
The purpose of introducing groups is to alow a process to deal with collections of other processes as a single abstraction.
A flat group is symmetrical and has no single point of failure.
- Advantage: If one of the processes crashes, the group simply becomes smaller, but can otherwise continue.
- Disadvantage: Decision making is more complicated.
A hierarchical group is asymmetrical with a single point of failure.
- Advantage: Decision making is simpler.
- Disadvantage: If the coordinator crashes, the entire group halts.

Consensus Problem¶

Each process has a procedure propose(val) and a procedure decide().
Each process first proposes a value by calling propose(val) once, with some argument val determined by the initial state of the process.
Each process then learns the value agreed upon by calling decide().

Properties¶

Safety Property 1 (Agreement): Two calls to decide() never return different values.
Safety Property 2 (Validity): If a process calls decide() with response v, then some process invoked a call to propose(v).
Liveness Property: If a process calls propose(v) or decide() and does not fail, then this call eventually terminates.

Variations¶

Circumstances Under Which Distributed Consensus Can Be Reached

Synchronous vs. Asynchronous Processes: Is there a bound on the amount of time it takes for a process to take its next step? Is the bound known by all processes?
Communication Delays: Is there a bound on the length of time it takes for a sent message to be delivered? Is the bound known by all processes?
Message Delivery Order: How does the order in which messages are sent affect the order in which they are delivered to the recipients?
Unicast vs. Multicast Messaging.

RPC Failure Semantics¶

The client is unable to locate the server.
The request message from the client to the server is lost.
The server crashes after receiving a request.
The reply message from the server to the client is lost.
The client crashes after sending a request.

Dealing with RPC Server Crashes¶

Reissue the request, leading to at-least-once semantics. As a side-effect, the request may be processed multiple times by the service handler, which is safe as long as the request is idempotent.
Give up and report a failure, leading to at-most-once semantics. There is no guarantee that the request has been processed.
Determine whether the request was processed and reissue if needed, leading to exactly once semantics. This scheme is difficult to implement as the server may have no way of knowing whether it performed a particular action.
Make no guarantees at all, leading to confusion.

Actions and Acknowledgments¶

Actions and Acknowledgements

Let $M$ be the server replying to the client with an acknowledgment message.
Let $P$ be the server executing a request from the client.
Let $C$ be the server crashing.

$M \to P \to C$ (Very Bad): A crash occurs after sending the completion message and executing the request.
$M \to C \to P$ (Very Bad): A crash happens after sending the completion message, but before executing the request.
$C \to M \to P$ (Bad): A crash happens before the server could do anything.
$P \to M \to C$ (Good): A crash occurs after sending the completion message and executing the request.
$P \to C \to M$ (Bad): The request is executed, after which a crash occurs, before the completion message could be send.
$C \to P \to M$ (Bad): A crash happens before the server could do anything.

Apache ZooKeeper¶

Overview¶

ZooKeeper is a centralized system that manages distributed systems as a hierarchical key-value store.
ZooKeeper emphasizes good performance (particularly for read-dominant workloads), being general enough to be used for many different purposes, reliability, and ease of use.

Common Use-Cases¶

Group Membership
Leader Election
Dynamic Configuration
Status Monitoring
Queuing
Barriers
Critical Sections

Data Model¶

ZooKeeper's Data Model: A hierchical key-value store similar to a file system.
ZNode: A node that may contain data and children.
Reads and writes to a single node are considered to be atomic with values read, or written fully, or not at all.

Node Flags¶

Ephemeral Flags: Make nodes exist as long as the session that created them is active, unless they were explicitly deleted.
Sequence Flags: Make nodes append a monotonically increasing counter to the end of their path.

Consistency Model¶

ZooKeeper ensures that writes are linearizable and that reads are serializable.
ZooKeeper guarantees per-client FIFO servicing of requests.

Servers¶

When running in replicated mode, all servers have a copy of the state in memory.
A leader is elected at startup, and all updates go through this leader.
Update responses are sent once a majority of servers have persisted the change.
To tolerate $n$ failures, $2n + 1$ replicated servers are required.

Distributed Commit and Checkpoints¶

ACID¶

Atomicity: An operation occurs fully or not at all. (Difficult)
Consistency: A transaction is a valid transformation of the state.
Isolated: A transaction is not aware of other concurrent transactions.
Durable: Once a transaction completes, its updates persist, even in the event of failure.

Two-Phase Commit (2PC)¶

Two-Phase Commit: A coordinator-based distributed transaction commitment protocol.
Phase 1: Coordinator asks participants whether they are ready to commit.
Phase 2: Coordinator examines votes and decides the outcome of the transaction.
- If all participants vote to commit, then the transaction is committed successfully.
- Otherwise, the transaction is aborted.

Key Assumptions¶

Synchronous Processes
Bounded Communication Delays
Crash-Recovery Failures
Stable Storage with Recovery Logs

States and Transitions¶

2PC States and Transitions

(a): Coordinator
(b): Participants

Participant-Participant Communication¶

Participant-Participant Communication

If a participant $P$ does not receive a decision from the coordinator within a bounded period of time, it may try to learn the decision from another participant $Q$.

Coordinator Crashes¶

A participant is able to make progress as long as it received the decision from the coordinator despite the crash, or if it was able to learn the decision from another participant.
In general, the transaction is safe to commit if all participants voted to commit (READY or COMMIT), and safe to abort otherwise.

Simultaneous Coordinator and Participant Crashes¶

A simultaneous failure of the coordinator and a participant makes it difficult to determine whether all the participants are all READY.

Distributed Checkpoints¶

Recovery after failure is only possible if the collection of checkpoints by individual processes forms a distributed snapshot.
A distributed snapshot requires that process checkpoints contain a corresponding send event for each message received.
Recovery Line: The most recent distributed snapshot.
Domino Effect: If the most recent checkpoints taken by processes do not provide a recovery line, then successively earlier checkpoints must be considered.
- i.e., Cascading Rollback.

Coordinated Checkpointing Algorithm¶

The coordinated checkpointing algorithm can be applied to create recovery lines.

Phase 1¶

The coordinator sends a CHECKPOINT_REQUEST message to all processes.
Upon receiving a CHECKPOINT_REQUEST message, each process does the following:
1. Pause sending new messages to other processes.
2. Takes a local checkpoint.
3. Returns an acknowledgment to the coordinator.

Phase 2¶

Upon receiving acknowledgments from all processes, the coordinator sends a CHECKPOINT_DONE message to all processes.
Upon receiving a CHECKPOINT_DONE message, each process resumes processing messages.

Raft Consensus Algorithm¶

Replicated State Machines¶

In a replicated state machine architecture, a consensus algorithm manages a replicated log containing state machine commands from clients.
The servers' state machines process identical sequences of commands from the logs, so they produce the same outputs.
As a result, the servers appear to form a single, highly reliable state machine.
Furthermore, the distributed system is available as long as any majority of the servers are operational and can communicate with each other and with clients.

Problems with Paxos¶

Unintuitive.
Incomplete.
- Multi-Paxos?
- Liveness?
- Cluster Membership Management?
Inefficient.
- 2 Message Rounds vs. Single Leader.
No Agreement on Implementation.

Cheat Sheet¶

Raft Consensus Algorithm

Key Properties¶

Election Safety: At most one leader can be elected in a given term.
Leader Append-Only: A leader never overwrites or deletes entries in its log; it only appends new entries.
Log Matching: If two logs contain an entry with the same index and term, then the logs are identical in all entries up though the given index.
Leader Completeness: If a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms.
State Machine Safety: If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.

Basics¶

Raft Basics

Server States¶

A Raft cluster contains several servers in which each server is in one of three states: leader, follower, and candidate.
Leader: A single active server that handles all client requests.
Follower: Many passive server that respond to requests from leaders and candidates.
Candidate: A server used to elect a new leader.

Terms¶

Terms act as a logical clock that allow servers to detect obsolete information.
Each server stores a current term number, which increases monotonically over time.
Current terms are exchanged whenever servers communicate.
If one server’s current term is smaller than the other’s, then it updates its current term to the larger value.
If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state.
If a server receives a request with a stale term number, it rejects the request.

Leader Election¶

Goal: A new leader must be chosen when an existing leader fails.

A server starts as a follower and remains a follower as long as it receives periodic, valid RPCs from a leader or a candidate.
If the follower receives no communication over the election timeout, then it begins a leader election.
The follower increments its current term and becomes a candidate.
The candidate votes for itself and broadcasts RequestVote RPCs in parallel to each of the other servers in the cluster.
1. If the candidate receives a majority of the votes, it becomes the new leader.
2. If the candidate receives an AppendEntries RPC from a leader whose term is at least as large as the candidate's current term, the candidate becomes a follower.
3. If the candidate neither wins nor loses the election, the candidate starts a new leader election.

Safety: Each server will vote for at most one candidate in a given term, on a first-come-first-served basis.
Liveness: Randomized election timeouts are used to ensure that split votes are rare.

Log Replication¶

Goal: The leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own.

The leader receives a command from a client.
The leader appends the command to its log as a new entry.
The leader broadcasts AppendEntries RPCs in parallel to each of the other servers in the cluster.
When the entry has been safely replicated, the leader applies the entry applies the entry to its state machine.
The leader returns the result of the command's execution to the client.

Important Note: For crashed or slow followers, the leader retries AppendEntries RPCs until they succeed.

Log Structure¶

Raft Log Structure

Committed: A durable entry guaranteed to be executed by all of the available state machines if it has been replicated on a majority of servers.

Log Matching Property¶

If two entries in different logs have the same index and term, then they store the same command.
If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.

AppendEntries Consistency Check (Induction Step)¶

When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries.
If the follower does not find an entry in its log with the same index and term, then it refuses the new entries.
Inconsistencies are handled by forcing the followers' logs to duplicate the leader's.

Safety¶

Goal: If any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index.

Leader Completeness Property¶

Leader Completeness Property: Once a log entry is committed, all future leaders must store that entry.
Therefore, servers with incomplete logs must not get elected.
- Candidates include the index and the term of the last log entry in their RequestVote RPCs.
- Followers denies a RequestVote RPC, if their logs are more up-to-date than the RPC.

Apache Kafka¶

Overview¶

Apache Kafka is an open-source distributed stream-processing platform.

Key Features¶

Publish-Subscribe Messages.
Real-Time Stream Processing.
Distributed and Replicated Message/Stream Storage.

Common Use-Cases¶

High-Throughput Messaging.
Activity Tracking.
Metric Collection.
Log Aggregation.

Topics, Partitions, and Retention Policy¶

Topic: A stream of key-value records that are stored as a partitioned log with a retention policy.
Partitions allow a topic to be parallelized by splitting its data among multiple Kafka brokers, which can have multiple replicas.
Kafka only provides a total order over records within a partition.
Kafka only retains a finite time period of records or performs log compaction to retain the latest record per key.

Producers¶

Producers push records to Kafka brokers for a specific partition.
Producers support asynchronous batching for improved throughput with a small latency penalty.
Producers support idempotent delivery to avoid duplicate commits.

Consumers¶

Consumers pull records in batches from Kafka brokers who advance the consumers' offsets within the topic.
Consumers support exactly once semantics when a client consumes from one topic and produces to another.

Record Streams vs. Changelog Streams¶

`KStream`-`KTable` Conversion

Record Streams: Where each record represents a single event.
Changelog Stream: Where each record represents an update to a state.
Kafka provides the KStream API for record streams.
Kafka provides the KTable API for changelog streams.

Streams-Tables Duality¶

Each record in a changelog stream defines the latest row operation in a table, per key.
Each row in a table defines the latest record in a changelog stream, per key.

Windowed Streams¶

Windowing allows for grouping records close in time.
Hopping Time Windows: Which are defined by a size and a hop.
Tumbling Time Windows: Which are special non-overlapping case of hopping time windows; the hop equals the size.
Sliding Windows: Which slide continuously over the time axis, used only for joins.
Session Windows: Which aggregate data by period of activity; new windows are created once a period of inactivity exceeds a certain length.

Clocks¶

Motivations¶

Clocks are instruments used to measure time, so a lack of synchronization among clocks in a distributed system can result in unclear ordering of events.

Calendars¶

Roman Calendar¶

The Roman calendar is a lunar calendar with months of 29 or 30 days.
The original Roman calendar had only 10 months, with winter days unallocated.
Reforms such as adding two extra months were later introduced, but the calendar remained difficult to align with seasons, since it is a lunar calendar.

Julian Calendar¶

The Julian calendar was the first solar calendar.
Introduced in 45 BC, the calendar improved on the Roman calendar, such as by adding the concept of a leap year.

Gregorian Calendar¶

The Gregorian calendar is the most widely used calendar in the world.
Introduced in 1582, its main improvement over the Julian calendar is measuring leap years far more accurately.

Timekeeping Standards¶

Solar Day: The interval between two consecutive transits of the sun; this is not constant.
TAI (Temps Atomique International): International time scale based on the average of multiple Cesium 133 atomic clocks.
UTC (Coordinated Universal Time): The world's primary time standard. Based on TAI, UTC uses leap seconds at irregular intervals to compensate for the Earth's slowing rotation.

Limitations of Atomic Clocks¶

Although atomic clocks are the most accurate timekeeping devices known, they are limited by relativistic time dilation.

Definitions¶

Let $C(t)$ denote the value of a clock $C$ at a reference time $t$.
Clock Skew of $C$ relative to $t$ is $\frac{dC}{dt} - 1$.
Offset of $C$ relative to $t$ is $C(t) - t$.
Maximum Drift Rate of $C$ is a constant $\rho$ such that $1 - \rho \le \frac{dC}{dt} \le 1 + \rho$.

Network Time Protocol (NTP)¶

NTP Timing Diagram

NTP: A common time synchronization protocol for computers over variable-latency data networks.

The offset of $B$ relative to $A$ is estimated as: $$\theta = \frac{(T_{2} - T_{1}) + (T_{3} - T_{4})}{2}$$
The one-way network delay between $A$ and $B$ is estimated as: $$\delta = \frac{(T_{4} - T{1}) - (T_{3} - T_{2})}{2}$$
NTP collects multiple $(\theta, \delta)$ pairs and uses the minimum value of $\delta$ as the best estimate of the delay.
The corresponding $\theta$ is taken as the most reliable estimate of the offset.

Clock Strata¶

A reference clock, such as an atomic clock, is said to operate at stratum 0.
A computer at stratum 1 has its time controlled by a stratum 0 device.
A computer at stratum 2 has its time controlled by a stratum 1 device.
Computers may only adjust the time of computers at greater strata, and when adjusting, must make sure time does not appear to flow backwards.

Clock Accuracy¶

NTP accuracy is typically measured in 10s of milliseconds.
PTP accuracy is typically measured in 100s of nanoseconds by leveraging hardware timestamping.

Logical Clocks¶

In the absence of tightly synchronized clocks, processes can still agree on a meaningful partial order of events following the "happens-before" relation.

Lamport Clocks¶

Before executing an event, process $P_{i}$ increments its own counter $C_{i} = C_{i} + 1$.
When $P_{i}$ sends a message $m$ to $P_{j}$, it tags $m$ with a timestamp $ts(m)$ equal to $C_{i}$.
When $P_{j}$ receives a message $m$ from $P_{i}$, it adjusts its counter to $C_{j} = \max(C_{j}, ts(m))$.
$P_{j}$ increments $C_{j}$ before delivering the message to the application.

Lamport clocks ensure the "happens-before" relation: $a \to b \implies C(a) < C(b)$.
However, Lamport clocks do not ensure the causality relation: $C(a) < C(b) \implies a \to b$.

Vector Clocks¶

Before executing an event, process $P_{i}$ increments its own counter $VC_{i}[i] = VC_{i}[i] + 1$.
When $P_{i}$ sends a message $m$ to $P_{j}$, it tags $m$ with a vector timestamp $ts(m)$ equal to $VC_{i}$.
When $P_{j}$ receives a message $m$ from $P_{i}$, it adjusts its counter to $VC_{j}[k] = \max(VC_{j}[k], ts(m)[k]) + 1, \forall k$.
$P_{j}$ increments $VC_{j}[j]$ before delivering the message to the application.

Vector clocks ensure the "happens-before" relation and the causality relation: $a \to b \Leftrightarrow C(a) < C(b)$.

CAP Principle¶

Brewer's Theorem¶

It is impossible for a distributed system to provide all three of:
1. Consistency: Nodes in a system agree on the most recent state of the data.
2. Availability: Nodes are able to execute read-only queries and updates.
3. Partition Tolerance: The system continues to function if its servers are separated into disjoint sets (e.g., because of a network failure).

CP vs. AP Systems¶

In the event of a partition (P), the system must choose either consistency (C) or availability (A), and cannot provide both simultaneously.
However, during failure-free operation, a system may be simultaneously highly available and strongly consistent.
CP System: In the event of a partition, choose Consistency (C) over Availability (A).
- Supports serializability, linearizability, sequential consistency, and $N_{R} + N_{W} > N$.
AP System: In the event of a partition, choose Availability (A) over Consistency (C).
- Supports eventual consistency, and causal consistency.
- Appropriate for many applications that are latency-sensitive, inconsistency-tolerant, and transactions-free.

PACELC¶

If there is a Partition (P), a choice must be made between Availability (A) and Consistency (C).
Else, a choice must be made between lower Latency (L) and Consistency (C).

Tunable Consistency¶

Key-Value, Strong Consistency: If $N_{R} + N_{W} > N$, then every read is guaranteed to observe the effects of all writes that finished before the read started.
The partial quorums for reads and writes can be determined in some key-value storage systems on a per-request basis using client-side consistency settings, leading to tunable consistency.

Client-Side Consistency¶

Sloppy Quorums: A partial quorum in which a set of replicas can change dynamically, such as to adjust to network partitions.
Hinted Handoff: Where an arbitrary node accepts an update and handsoff the update to the desired node once the network partition is removed.

Apache Cassandra¶

A quorum-replicated key-value store supporting tunable consistency with optional, full write availability.

Data Model¶

Keyspace: A namespace for column families.
Column Family: A table of columns consisting of a name, a value, and a timestamp.
Row Key: A mandatory key that uniquely identifies each row.
Sparse-Column Storage: For a given row, only the columns present are stored; i.e., NULL values are ignored.
Supports Hash Indices.
No Joins or Foreign Keys.

Consistency¶

ONE: $N_{R} / N_{W} = 1$.
ANY: For writes only, like ONE, but uses hinted handoff if needed.
TWO: $N_{R} / N_{W} = 2$.
THREE: $N_{R} / N_{W} = 3$.
QUORUM: $N_{R} / N_{W} = \text{Ceiling}[(N + 1) / 2]$.
ALL: $N_{R} / N_{W} = N$.
LOCAL_ONE / LOCAL_QUORUM: Like ONE / QUORUM, but the subset of replicas is chosen from the local data center only.
EACH_QUORUM: For writes only, writes to a quorum in each data center.

Puts¶

A PUT operation is executed on behalf of a client by a coordinator.
The coordinator broadcasts the update to all replicas of a row.
The consistency level determines only how many acknowledgments the coordinator waits for.

Gets¶

A GET operation is executed on behalf of a client by a coordinator.
The coordinator broadcasts the read to all replicas of a row using the following kinds of requests.
1. Direct Read Request: Retrieves data from the closest replica.
2. Digest Request: Retrieves a hash of the data from the remaining replicas; the coordinator waits for at least $N_{R} - 1$ of these to respond.
3. Background Read Repair Request: Sent if a disrepancy is detected among the hashes reported by different replicas, tells the replica to obtain the latest value.