ECE 459 - Programming for Performance¶

Programming for Performance¶

Performance¶

Items Per Unit Time (Bandwidth - More Is Better): A measure of how much work can be done simultaneously.
Time Per Item (Latency - Lower is Better): A measure of how much time it takes to do any one particular task.

Improving Latency¶

Profile the Code.
Do Less Work: Logging & Reporting.
Be Prepared.
Be Smarter: Better Asymptotic Performance & Smaller Constant Factors.
Improve Hardware: CPU, Memory, Network, Storage.
Assembly: Provide Compiler Hints & Vectorization.

Improving Bandwidth¶

Parallelism: Algorithmic.
Pipelining: Intra-CPU.
Hardware: Inter-CPU.

Difficulties with Parallelism¶

All tasks are not inherently parallelizable; thus, an infinite number of CPUs will not improve bandwidth.
All tasks can be composed of various parallel and sequential sub-tasks; thus, there is difficulty in identifying these different sub-tasks.
Parallel tasks can often lack total ordering yet possess partial ordering; thus, parallel tasks can be harder to reason and to test.
Parallel tasks are often difficult to cache.
Data Race: When two threads/processes both attempt to simultaneously access the same data, and at least one of the accesses is a write. This can lead to incorrect intermediate states becoming visible to one of the participants.
Deadlock: When none of the threads/processes can make progress on a task because of a cycle in the resource requests.

Modern Processors¶

von Neumann Machine Architecture: A program is comprised of both instructions and data, both of which are stored in the same memory, and the program executes sequentially, one statement at a time.
CISC Machines: Complex Instruction Set Computing.
- Advantage: Complex Features.
- Disadvantage: Difficult Pipelining.
- More Cycles / Instruction.
RISC Machines: Reduced Instruction Set Computing.
- Advantage: Easy Pipelining.
- Disadvantage: Complex Compilers.
- Less Cycles / Instruction.
Old Optimization Goal #1: ~~Minimize Page Faults~~ - Memory is Cheap.
Old Optimization Goal #2: ~~Minimize Instruction Count~~ - Storage is Cheap.

CPU Walls¶

Wall #1: CPU clock speeds stopped getting faster than 3 GHz since 2005.
Wall #2: Branch prediction stopped at 95% efficiency.
Wall #3: Memory speeds have not increased; thus, runtime is dominated by cache misses.
Wall #4: CPU hardware is limited by the speed of light.

Pipelining¶

Fetch Instruction.
Decode Instruction.
Fetch Operands.
Perform Operation .
Write Result.

Pipelining allows the stages of various instructions to be executed in parallel.

Hazards¶

An instruction may need the result of a previous instruction.
Multiple instructions can conflict over CPU resources.
A fetch may be unable to identify the next instruction because of a branch.
If a branch was mispredicted, the pipeline must be flushed.

Other Modern Processor Features¶

Miss Shadow: A CPU can allow multiple LOAD instructions in parallel, even if a register is not ready, by continuing execution until the specific registers are used.
Branch Prediction: A CPU can predict what branch will be executed such that the CPU can eagerly pipeline the predicted instructions. If the CPU is wrong, then it flushes the pipeline.
Dual Issue Instructions: If two consecutive instructions take the same amount of cycles, use unrelated registers, and do not consume two of the same resource, then a CPU can execute the two instructions in parallel.
Register Renaming: A CPU can map registers specified within the instructions to different physical register to prevent pipeline hazards.
- This helps branch prediction to recover from mispredictions faster.
- This help handle cache misses.
Out-of-Order Execution: A CPU can execute instruction non-sequentially to improve the performance of other processor features.

A Deeper Look at Cache Misses¶

All memory addresses are mapped to pages.
- Cache Hit: If a page is found in the cache.
- Cache Miss: If a page is not found in the cache.
A cache miss requires the page to be loaded from memory.
Hit Ratio: The percentage of the time that a page is found in the cache.

Cache Levels Hierarchy (Fastest to Slowest)¶

L1 Cache.
L2 Cache.
L3 Cache.

Effective Access Time Formula (w/o Disk)¶

$$\text{Effective Access Time} = h \times t_{c} + (1 - h) \times t_{m}$$

Where $h$ is the hit ratio.
Where $t_{c}$ is the time required to load a page from cache.
Where $t_{m}$ is the time required to load a page from memory.

Effective Access Time Formula (w/ Disk)¶

$$\text{Effective Access Time} = h \times t_{c} + (1 - h)(p \times t_{m} + (1 - p) \times t_{d})$$

Where $h$ is the hit ratio.
Where $p$ is the probability that a page is in memory.
Where $t_{c}$ is the time required to load a page from cache.
Where $t_{m}$ is the time required to load a page from memory.
Where $t_{d}$ is the time required to load a page from disk.

CPU Hardware, Branch Prediction¶

Multicore Processors¶

A symmetric multiprocessor contains multiple physical CPUs, and each physical CPU can have multiple cores (virtual CPUs).
- Important Note 1: The various cores in a CPU share the same cache.
- Important Note 2: The various CPUs have approximately the same access time for resources.
Non-Uniform Memory Access (NUMA): When various CPUs can access different resources at different speeds.
Affinity: To prevent task switches when two threads are executed on one CPU, affine each thread to different CPUs.

Branch Prediction and Misprediction¶

Branch Prediction: When the compiler and the CPU analyze instructions to decide whether a branch is taken.
Branch Hints: In gcc, the __builtin_expect() function allows the program to provide the compiler hints.

Branch Prediction Models¶

Always-Wrong Prediction Model: $$\text{Average Cycles Per Instruction} = \text{non-branch}\% \times n_{avg} + \text{branch}\% \times n_{wrong}$$
Always-Right Prediction Model: $$\text{Average Cycles Per Instruction} = \text{non-branch}\% \times n_{avg} + \text{branch}\% \times n_{right}$$
General Prediction Model: $$\text{Average Cycles Per Instruction} = \text{non-branch}\% \times n_{avg} + \text{branch}\% \times (\text{inaccuracy}\% \times n_{wrong} + \text{accuracy}\% \times n_{right})$$

Static Schemes¶

Backwards Taken, Forwards Not Taken (BTFNT): A prediction model that optimizes branch predictions for loops, observing that loop branches are backwards.

Dynamic Schemes¶

1-Bit Scheme: For every branch, record whether it was taken or not.
- Because not all branches can be stored, the lowest 6 bits of the memory address can be used to identify branches.
- Aliasing: When different branches map to the same entry in the branch prediction table.
2-Bit Scheme: For every branch, record whether it is usually taken.
- Not Taken: $00$ or $01$.
- Taken: $10$ or $11$.
Two-Level Adaptive, Global: From a branch address and a global history, an index is derived, which points to a table of 2-bit saturating counters; it's adaptive because a different history will yield a different table entry.
Two-Level Adaptive, Local: Similar to the two-level adaptive, local scheme, but the CPU keeps a separate history for each branch.
gshare: This dynamic scheme reduces the size of other dynamic schemes by combining a branch address and its history with an XOR.

Cache Coherency¶

Cache Coherency: When the values in all caches are consistent; and to some extent, the system behaves as if all CPUs are using shared memory.
As modern CPUs have 3 or 4 cache levels, L3 cache is often used for cache coherency communication, and L4 cache is often used for integrated graphics.

Snoopy Caches¶

Every CPU knows whether its cached copy of data from main memory is shared or not.
Thus, whenever a CPU issues a memory write, the other CPUs snoop to observe if the memory location is in their cache.
If the memory location is in their cache, the other CPUs either updates or invalidates their cached copy.

Write-Through Caches¶

Algorithm¶

All cache writes are done to main memory.
All cache writes appear on the shared bus.
If another CPU snoops and sees it has the same location in its cache, it will either update or invalidate the data.

Variants¶

Write No-Allocate: When writing an invalidated location, the cache can be bypassed, and the write can go directly to memory.
Write Broadcast: When writing, all versions in all caches are updated.

Protocol (State Machine)¶

State	Observed	Generated	Next State
Valid	PrRd	~	Valid
Valid	PrWr	BusWr	Valid
Valid	BusWr	~	Invalid
Invalid	PrWr	BusWr	Valid
Invalid	PrRd	BusRd	Valid

Where Valid and Invalid are the two possible states for each cached memory location.
Where Pr is the processor.
Where Bus is the bus.
Where Rd is a read and Wr is a write.

Write-Back Caches¶

MSI Protocol¶

State	Observed	Generated	Next State
Modified	PrRd	~	Modified
Modified	PrWr	~	Modified
Modified	BusRd	BusWB	Shared
Modified	BusRdX	BusWB	Invalid
Shared	PrRd	~	Shared
Shared	BusRd	~	Shared
Shared	BusRdX	~	Invalid
Shared	PrWr	BusRdX	Modified
Invalid	PrRd	BusRd	Shared
Invalid	PrWr	BusRdX	Modified

Where Modified is the state when only this cache has a valid copy; main memory is out-of-date.
Where Shared is the state when the location is unmodified, up-to-date with main memory.
Where Invalid is the same as before.
Where BusWB is a write-back and BusRdX is an exclusive read.
Note: The initial state for a memory location, upon its first read is shared.

MESI Protocol - Extension to MSI¶

Exclusive is the state when only this cache has a valid copy; main memory is up-to-date.
This protocol allows a processor to modify data exclusive to it, without having to communicate with the bus.

MESIF Protocol - Extension to MESI¶

Foward is another shared state; however, the current cache is the only one that will respond to a request to transfer data.
This protocol reduces bus arbitration or bus contention by preventing multiple caches from answering.

When a program has two unrelated data elements that are mapped to the same cache line, false sharing could occur; thus, an invalidation of one data element would invalidate the other data element.

Volatile¶

In C, the volatile keyword provides the following features:
1. Allow access to memory mapped devices.
2. Allow uses of variables between setjmp and longjmp.
3. Allow uses of sig_atomic_t variables in signal handlers.
The volatile keyword does not prevent reordering of instructions.

Concurrency and Parallelism¶

Amdahl's Law¶

$$T_{p} = T_{s} \times (S + \frac{P}{N})$$$$\text{Speed-Up} = \frac{T_{s}}{T_{p}} = \frac{1}{S + P/N} \to \frac{1}{1 - P}$$

Where $T_{p}$ is the parallel time.
Where $T_{s}$ is the serial time.
Where $S$ is the serial fraction of the program.
Where $P$ is the parallel fraction of the program.
Where $N$ is the number of processors.

Assumptions¶

The problem size is fixed.
The program behaves the same on 1 processor as on $N$ processors.
The runtimes can be accurately measured.

Generalized Amdahl's Law¶

Where $f_{1}, f_{2}, ..., f_{n}$ are the fractions of time in part $n$ of a program.
Where $S_{f_{1}}, S_{f_{2}}, ..., S_{f_{n}}$ are the speed-ups for part $n$ of a program.

$$\text{Speed-Up} = \cfrac{1}{\frac{f_{1}}{S_{f_{1}}} + \frac{f_{2}}{S_{f_{2}}} + ... + \frac{f_{n}}{S_{f_{n}}}}$$

Parallelization Issues¶

Locking and Synchronization Points $\implies$ Resource Contention.
Centralized Memory Allocators.
Overhead with Creating/Destroying Resources.
Multiplexing Software Threads onto Hardware Threads.
- Kernel-Level Threading: $1:1$.
- User-Level Threading: $N:1$.
- Hybrid Threading: $M:N$.

Proccesses vs. Threads¶

Processes¶

Each process has its own virtual address space.
Buffer overruns or other security holes do not expose other processes.
If a process crashes, the others can continue.

Threads¶

Interprocess communication is more complicated and slower than interthread communication; must use operating system utilities.
Processes have much higher startup, shutdown, and synchronization costs than threads.
pthreads fix the issues of clone and provide a uniform interface for most systems.

Parallelization Patterns¶

Multiple Independent Tasks $\implies$ No Communication, Simple Scalability.
Multiple Loosely-Coupled Tasks $\implies$ Moderate Communication & Moderate Scalability.
Multiple Copies of the Same Task $\implies$ No Communication, Simple Scalability.
Single Task, Multiple Threads (Divide-and-Conquer) $\implies$ Complex Communication, Reduced Latency ~ Improved Throughput.
Pipeline of Tasks $\implies$ Moderate Communication, Constant Latency ~ Improved Throughput.
Client-Server $\implies$ Moderate Communication, Improved Throughput.
Producer-Consumer $\implies$ Simple Communication, Improved Throughput.

Working with Threads¶

See Lecture 6 - Working with Threads for pthreads.

Race Conditions & Synchronization¶

Race Conditions¶

A race occurs when you have two concurrent accesses to the same memory location, at least one of which is a write.

Dependencies¶

RAW (Read After Write): The read has to take place after the write, otherwise there's nothing to read, or an incorrect value will be read.
WAR (Write After Read): A write cannot take place until the read has happened, to ensure the read takes the correct value.
WAW (Write After Write): A write cannot take place because an earlier write needs to happen first. If we do them out of order, the final value may be stale or incorrect.
RAR (Read After Read): No hazards.

~	Read 2nd	Write 2nd
Read 1st	RAR - No Dependency	WAR - Antidependency
Write 1st	RAW - True Dependency	WAW - Output Dependency

WAR and WAW inhibit parallelization, so copying data and immutable data can be employed to eliminate dependencies.

Synchronization¶

Mutual Exclusion: Mutexes allow only one thread to be protected by a mutex at a time.
- Locks protect resources because only one thread can hold a lock at a time.
- Other threads trying to contending for the lock is blocked until the first thread releases the lock.
- Critical Section: The code between lock acquisition and lock release.
Spinlocks: A variant of mutexes where the waiting thread repeatedly tries to acquire the lock instead of sleeping.
- Spinlocks consume lots of CPU resources, expecting critical sections to finish quickly.
Semaphores: A synchronization primitive with a counter used to signal between threads.
Barriers: A synchronization primitive that ensures a collection of threads all reached the barrier.
Reader/Writer Locks: A synchronization primitive that allows any number of readers to be in the critical section simultaneously, but only one writer may be in the critical section exclusively.
Lock-Free Code: Complex synchronization abstractions that rely on CPU atomic operations to provide lock-free synchronization abstractions.

Asynchronous I/O¶

See Lecture 8 - Asynchronous I/O for cURL.

Of Asgard Hel¶

See Lecture 9 - Of Asgard Hel for Valgrind.

Use of Locks, Reentrancy¶

Appropriate Locking¶

The critical section should be kept as small as possible because it improves performance, and the contention for the lock is expensive.

Locking Granularity¶

Coarse-Grained Locking: When you lock large sections of your program with a single big lock.
Fine-Grained Locking: When you lock small sections of your program with multiple small locks.

Lock Overhead¶

The memory allocated for the lock.
The time to create and to destroy the lock.
The time to acquire and release the lock.

Lock Contention¶

Making dependent locking regions more granular reduces contention.
Making independent locking regions to use different locks reduces contention.

Deadlock Conditions¶

Mutual Exclusion: A resource belongs to, at most, one process at a time.
Hold-and-Wait: A process that is currently holding some resources may request additional resources and may be forced to wait for them.
No Preemption: A resource cannot be "taken" from the process that holds it; only the process currently holding that resource may release it.
Circular-Wait: A cycle in the resource allocation graph.

Reentrancy¶

A function is thread-safe if it can be executed from more than one thread at the same time.
A function is reentrant if every invocation is independent of every other invocation; thus, a reentrant function is always thread-safe.
- A reentrant function can be invoked while the function is already executing, possibly from the same thread.
- A reentrant function can be restarted without affecting its output.

Lock Convoys, Atomics, Lock-Freedom¶

Lock Convoys¶

A lock convoy occurs when 2+ threads at the same priority frequently contend for a synchronization object, even if they only hold that object for a very short amount of time; thus, the CPU spends all its time on context switches.
- Side Effect 1: The threads apart of the convoy run for very short periods before blocking.
- Side Effect 2: The threads not apart of the convoy but of the same priority run for very long periods.
Diagnosis: If a lock has a nonzero number of waiting threads, but none of the threads appears to own it, then there might be a lock convoy.
Solution: The use of unfair locks helps with lock convoys.
1. By not giving the lock to $B$ when $A$ releases the lock, the lock becomes unowned.
2. The scheduler chooses another thread to switch to after $A$.
  - If it's $B$, then it gets the lock and continues.
  - If it's $C$ and it does not request the lock, then it continues.
  - If it's $C$ and it does request the lock, then it gets the lock and continues.
3. If the lock was fair and $B$ owned the lock, but it was at the back of the queue, then no one such as $C$ could acquire the lock and make progress; thus, unfair locks improve throughput.

Lock Convoy Mitigations¶

Sleep: If the threads not in the lock convoy call sleep() frequently, then the threads in the lock convoy have an increased probability to make progress.
Sharing: If the program benefits from the use of reader-writer locks, then by sharing these locks, the threads in the lock convoy would contend with each other less.
Caching: If the threads in the lock convoy contend such that they can access a critical section guarding shared data, caching the shared data to reduce the critical section would improve throughput.
Try-Lock: A try-lock synchronization primitive continually tries to acquire the lock, and if it fails, then it yields the CPU to some other thread.
- The use of a spin limit allows low priority threads to execute in a critical section; the threads can recover from contention without creating a convoy.

The Thundering Herd Problem¶

When some condition is fulfilled, it can trigger a large number of threads to wake and try to take some action.
Likely, all the of threads cannot proceed, so some of the threads will get blocked.
Thus, it is better to wake one thread at a time instead of all of the threads.

The Lost Wakeup Problem¶

When calling notify() to wake a single thread instead of all the threads, it is possible that the notify() becomes lost.

Atomics¶

Atomic: A lower-overhead alternative to locks that executes its operation indivisibly; other threads see the states before or after the operation, but nothing in between.
- Operations: Reads, Writes, Read-Modify-Write, Compare-and-Swap.
ABA Problem: A compare-and-swap operation can suffer false positives if it compares with a memory location with a value of $A$, but before it proceeds, the memory location changes from $A$ to $B$ to $A$.

Lock-Freedom¶

A non-blocking data structure is one where none of the operations can result in being blocked.
A lock-free data structure is a thread-safe data structure that does not use locks.
- If any thread performing an operation gets suspended during the operation, then other threads accessing the data structure are still able to complete their tasks.
A wait-free data structure is a thread-safe data structure that ensures each thread will complete its operations in a bounded number of steps regardless of what any other threads do.
Note: Lock-free algorithms aim to ensure that there is forward progress in a system; it's aim is not performance.

Autoparallelization¶

Three Address Code¶

Compilers convert ASTs into an intermediate, portable, three-address code for analysis. $$\text{result} := \text{operand}_{1} \text{ operator } \text{operand}_{2}$$
The gcc flags -fdump-tree-gimple and -fdump-tree-all can be used to see all the three address code.

`restrict` Qualifier¶

The restrict qualifier on a pointer p tells the compiler that it may assume that, in the scope of p, the program will not use any other pointer q to access the data at *p.
The restrict qualifier allows a compiler to optimize code, especially critical loops, better.

Automatic Parallelization Compilers¶

icc: Intel C Compiler.
cc: Solaris Studio Compiler.
gcc: GNU C Compiler - Graphite.
clang: Clang Compiler - polly.
Note: Most compilers's parallelization frameworks use OpenMP.

Automatic Parallelization Loops¶

Single Loop¶

for (i = 0; i < 1000; i++)
  x[i] = i + 3;

Nested Loops with Simple Dependency¶

for (i = 0; i < 100; i++)
  for (j = 0; j < 100; j++)
    x[i][j] = x[i][j] + y[i - 1][j];

Single Loop with Not-Very-Simple Dependency¶

for (i = 0; i < 10; i++)
  x[2 *i + 1] = x[2 * i];

Single Loop with If Statement¶

for (j = 0; j <= 10; j++)
  if (j > 5) x[i] = i + 3;

Triangle Loop¶

for (i = 0; i < 100; i++)
  for (j = i; j < 100; j++)
    x[i][j] = 5;

Automatic Parallelization Tips¶

Have a recognized loop style, e.g. for-loops with constant bounds.
Have no dependencies between data accessed in loop bodies for each iteration.
Not conditionally change scalar variables read after the loop terminates, or change any scalar variable across iterations.
Have enough work in the loop body to make parallelization profitable.
Inline pure functions or convert functions to macros.

OpenMP¶

See Lecture 13 - OpenMP for OpenMP.

OpenMP Tasks¶

See Lecture 14 - OpenMP Tasks for OpenMP Tasks.

Memory Consistency¶

OpenMP Memory Model¶

Relaxed-Consistency, Shared-Memory Model:
- All threads share a single store called memory which is not representative of RAM.
- Each thread can have its own temporary view of memory.
- Each thread's temporary view of memory is not required to be consistent with memory.

Flush Directive¶

As updates from one thread may not be seen by the other, OpenMP's flush directive makes a thread's temporary view of memory consistent with the main memory, by enforcing an order on the memory operations of the variables.
```
#pragma omp flush [(list)]
```
- Where the flush-set is the variables in the list.
- A flush directive can be reordered by the compiler, so all relevant variables must be provided in the flush-set.

Ordering Rules¶

All read/write operations on the flush-set which happen before the flush complete before the flush executes.
All read/write operations on the flush-set which happen after the flush complete after the flush executes.
Flushes with overlapping flush-sets can not be reordered.

Implicit Flush - Yes¶

See Lecture 15 - Memory Consistency.

Implicit Flush - No¶

See Lecture 15 - Memory Consistency.

Common Performance Issues in OpenMP¶

Unnecessary Flushing.
Using Critical Sections/Locks vs. Atomics.
Unnecessary Concurrent-Memory-Writing Protection.
- Local Thread Variables.
- Single/Master Access.
Too Much Work in Critical Secitons.
Too Many Entries into Critical Sections.

Memory Consistency Models¶

Reordering: A compiler or a processor may reorder non-interfering memory operations within a thread to speed-up code.
Sequential Consistency: No Reordering of Loads/Stores.
Sequential Consistency for Datarace-Free Programs: No Data Races.
Relaxed Consistency: Loads Reordered After Loads/Stores + Stores Reordered After Loads/Stores.
Weak Consistency: Any Reordering is Possible.

Memory Barriers¶

A memory barrier or fence prevents reordering or ensures that memory operations become visible in the right order.
A memory barrier ensures that no access occuring after the barrier becomes visible to the system until after all accesses before the barrier become visible.

x86 Memory Barriers¶

mfence: All loads and stores before the barrier become visible before any loads and stores after the barrier become visible.
sfence: All stores before the barrier become visible before all stores after the barrier become visible.
lfence: All loads before the barrier become visible before all loads after the barrier become visible.

C/C++ 11 Memory Model¶

Before 11: No Multi-Threaded Abstract Machine w/ Concurrency Primitives.
After 11: Yes Multi-Threaded Abstract Machine w/ Concurrency Primitives.

Dependencies and Speculation¶

Dependencies¶

A dependency prevents parallelization when the computation $XY$ produces a different result from the computation $YX$.
- Loop-Carried Dependency: An iteration depends on the result of the previous iteration.
- Memory-Carried Dependency: The result of a computation depends on the order in which two memory accesses occur.

Critical Paths¶

A critical path is the minimum amount of time to complete a task, taking dependencies into account.

Speculative Execution¶

A thread can be initialized to compute a result that may or may not be needed.
- No-Speculation: $T = T_{1} + p T_{2}$
- Yes-Speculation: $T = \max(T_{1}, T_{2}) + S$
- Where $S$ is the synchronization overhead.
- Where $p$ is the probability of executing $T_{2}$.

Value Speculation¶

If there is a true dependency between the result of a computation and its successor and the result is predictable, speculatively execute the successor based on the predicted result.
- No-Speculation: $T = T_{1} + T_{2}$
- Yes-Speculation: $T = \max(T_{1}, T_{2}) + S + p T_{2}$
- Where $S$ is the synchronization overhead.
- Where $p$ is the probability of the predicted result being incorrect.

When Can We Speculate?¶

No Inter-Calls between Computation 1 and Computation 2.
Computation 2 Cannot Depend on Modified Values from Computation 1.
Computation 1 Must Be Deterministic.

Software Transactional Memory¶

Software Transactional Memory: All the code within an atomic block executes completely, or aborts/rolls back in the event of a conflict with another transaction.

Advantages¶

Simple.

Disadvantage¶

Impossible to Rollback I/O.
Difficulty with Nested Transactions.
Transaction Size Limits.

Post-Midterm Content¶

See Lectures 17 to 36 for Post-Midterm Content; School Closure b/c Pandemic.

ECE 459 - Programming for Performance¶

Programming for Performance¶

Performance¶

Improving Latency¶

Improving Bandwidth¶

Difficulties with Parallelism¶

Modern Processors¶

Modern Processors¶

CPU Walls¶

Pipelining¶

Hazards¶

Other Modern Processor Features¶

A Deeper Look at Cache Misses¶

Cache Levels Hierarchy (Fastest to Slowest)¶

Effective Access Time Formula (w/o Disk)¶

Effective Access Time Formula (w/ Disk)¶

CPU Hardware, Branch Prediction¶

Multicore Processors¶

Branch Prediction and Misprediction¶

Branch Prediction Models¶

Static Schemes¶

Dynamic Schemes¶

Cache Coherency¶

Cache Coherency¶

Snoopy Caches¶

Write-Through Caches¶

Algorithm¶

Variants¶

Protocol (State Machine)¶

Write-Back Caches¶

MSI Protocol¶

MESI Protocol - Extension to MSI¶

MESIF Protocol - Extension to MESI¶

False Sharing¶

Volatile¶

Concurrency and Parallelism¶

Amdahl's Law¶

Assumptions¶

Generalized Amdahl's Law¶

Parallelization Issues¶

Proccesses vs. Threads¶

Processes¶

Threads¶

Parallelization Patterns¶

Working with Threads¶

Race Conditions & Synchronization¶

Race Conditions¶

Dependencies¶

Synchronization¶

Asynchronous I/O¶

Of Asgard Hel¶

Use of Locks, Reentrancy¶

Appropriate Locking¶

Locking Granularity¶

Lock Overhead¶

Lock Contention¶

Deadlock Conditions¶

Reentrancy¶

Lock Convoys, Atomics, Lock-Freedom¶

Lock Convoys¶

Lock Convoy Mitigations¶

The Thundering Herd Problem¶

The Lost Wakeup Problem¶

Atomics¶

Lock-Freedom¶

Autoparallelization¶

Three Address Code¶

restrict Qualifier¶

Automatic Parallelization Compilers¶

Automatic Parallelization Loops¶

Single Loop¶

Nested Loops with Simple Dependency¶

Single Loop with Not-Very-Simple Dependency¶

Single Loop with If Statement¶

Triangle Loop¶

Automatic Parallelization Tips¶

OpenMP¶

OpenMP Tasks¶

Memory Consistency¶

OpenMP Memory Model¶

`restrict` Qualifier¶