Concurrency
- Memory Models
- Data Structures
- Misc
See also notes on:
- Computer Architecture
- Operating Systems
Memory Models
What’s the point? There is a mismatch between both
- High level languages (C/Java/Ocaml) and assembly
- Assembly and CPU internals
This mismatch is good. The mismatch is abstraction. It gives simpler programming models to users. It allows certain optimizations to happen that the programmer wouldn’t have thought of. The programmer doesn’t necessarily want to say exactly which transistor is flipped at exactly what moment. There is a higher level goal of solving a matrix problem, or writing a record to memory, or swapping two variables.
In a non threaded code we actually don’t care if the follwing two instruction are reordered. store 1 in a store 42 in b
And if there is some reason why it would be faster to do so, sure, go for it compiler. So the program order is not really the execution order insisted upon by single threaded code. It doesn’t however alllow rearrangement of
print(“hello”) print(“world”)
Perhaps however, we are relying on this order to implement some shared primitive.
When you compile, there are intermediate states of the target that don’t really correspond to anything. Depending on the conceptual distance between the source and target, these states may be short. If you point at some particular section of assembly, you probably say, ah yes, it is in the middle of trying to achieve this statement in the source. “In the middle” is not a very precise statement though.
Some things that seem intuitively atomic at the source are not in the target. One statement may become multiple, that a concurrent processor could screw with.
Example programs
Axiomatic Semantics
Lecture Notes on Weak Memory Models – Part I Explaining Relaxed Memory Models with Program Transformations Ori Lahav and Viktor Vafeiadis are two good names to know
rewriting
What is an execution?
Execution graphs. A pile of dependency relations.
Some events happen out of possible events. Some events are reads, some are writes.
Reads read a value that was written somewhere.
Data race = two events, at least one of which is a write (aka not double read).
An overapproximate model is one in which all reads are chaosed. Maybe this even makes sense if stuff gets lost in transit.
A more reasonable thing is that every read must have been written in a different event somewhere. For bool variables, you can probably find any possibly value somewhere, so this actually isn’t even that much of a restricting assumption.
Operational Semantics
A Promising Semantics for Relaxed-Memory Concurrency Promising 2.0: Global Optimizations in Relaxed Memory Concurrency
What is the notion of “state” Keeping history. Possibly you may read from history Traces translate to execution graphs
Timestamps is a useful idea. Timestamps may not be totally ordered. They might be vectors of local times for example
Rewriting
What rewrites of the program are “allowed”. AKA, what compiler optimizations are allowed.
Memory Order vs Program Order
Program order is the code order in the source. This is vaguely evokes the image of a linear sequence of statements. This is not really what high level languages look like. They have compound statements and expressions, and control flow. The syntax commonly is thought to be tree-like. Is a statement in a loop “before” another? They are executed multiple times, so sure the action during the execution of the loop comes before another, but syntactic statements in the loop correspond to a whole family of actions across all executions of the loop.
Even in single threaded code, execution does not need to follow program order. Non interfering stores and loads are allowed to be slid over each other. Common subexpression elimination means computations may not even happen when they do in the program. Function calls and system calls are probably required to stay in the same order as the source.
We could describe some relatively concrete concurrent machine, with some layer of caching. Does the variable “a” in code exist in the execution of this machine. We are already used to “a” depending on time. But it also not depends on location. It is conceivable that Thread 1, Thread 2, Cache 1 have different opinions on what “a” at the “same” time. Their clocks stay synchronized? Quite an assumption.
The “git” model
When you write to a variable, the new value is not published everywhere. There is a separate event where this value may be published to another thread.
Not all variables may be published to the other thread at once.
It isn’t persay required in a general memory model that store(a,1) ;store(a,2)
has to necessarily publish the store of 1 before the store of 2, or that the store of 1 can’t be seen even after you’ve seen the store of 2. These are certain requirements of in order arrival and dedeuplication of publish messages.
Automated Reasoning
MemSAT: Checking Axiomatic Specifications of Memory Models
https://github.com/rems-project/isla “isla is a symbolic execution engine for Sail, and a tool (sometimes known more specifically as isla-axiomatic) that uses that to evaluate the relaxed-memory behavior of instruction set architectures (ISAs) specified in Sail”
http://diy.inria.fr/doc/index.html herd7
ASP seems like a reasonable language to work in to me. Well, of course it does
% program order po
% transitive
po(A,C) :- po(A,B), po(B,C).
% actions
% write(A, L). action A writes to locatin L
% read(A,L) action A reads from location L
% rf reads from
:- rf(A,B), rf(A,B1), B != B1. % rf is a functional relationship
% we obey the program order
:- po(B,A), rf(A,B).
% memory order. Order on write events to same location. mo(A,A)
Consistency vs Coherence
Sequential Consistency (SC)
Total Store Order (TSO)
Relaxed Memory Models
Fence Instructions
The fact that assembly instructions are a linear stream is a bit silly. This is not a great match for how the unstructions are executed. They are executed out of order and pipelined. The processor can see local data dependncies and creates stalls for those. But nonlocal dependencies on simulatnesous memory access can’t be automatically seen? So you can sort of demarcate regions of the linear stream tht can’t be moved past each other.
VLIW processors for example can schemes for having streams of concurrent operations.
x86
https://stackoverflow.com/questions/62465382/how-do-modern-intel-x86-cpus-implement-the-total-order-over-stores/62480523#62480523
C concurrency model on x86 for dummies https://news.ycombinator.com/item?id=32520365
Arm
https://developer.arm.com/documentation/dui0489/c/arm-and-thumb-instructions/miscellaneous-instructions/dmb--dsb--and-isb https://stackoverflow.com/questions/15491751/real-life-use-cases-of-barriers-dsb-dmb-isb-in-arm
- dmb loadload barrier. Can’t move loads across this
- dsb
- isb
Language Models
C++
Java
Ocaml
a memory model for multicore ocaml bounding data races in space and time You can see in the references to also Go, Swift, Rust, wasm work arm64 mutlicore support interesing discussions
Data Structures
Lock Mutex Semaphore
Double Locking
Misc
Arguably these are garbage collection topics?
RCU
Read Copy Update https://en.wikipedia.org/wiki/Read-copy-update
Kind of a functional approach. Make new nodes, keeep old ones around.
Epoch
Very similar/identical to RCU? Maintain global epoch counter. Data is more read only than reference counting
Hazard Pointers
https://en.wikipedia.org/wiki/Hazard_pointer
Lock free
https://en.wikipedia.org/wiki/Non-blocking_algorithm https://www.youtube.com/watch?v=RWCadBJ6wTk&ab_channel=NDCConferences cliff click a lock free hash table Compare and swap operations. lock -free - at least one thread is making progress. Consider schedulaers. Consider threads getting killed CAS loop wait free - all threadsmaking progress obstruction free ABA problem Lock-free linked list / stack. Pointers with counters so that you can detect reuse of freed memory
Barriers
Atomics are a way to mark thread shared variables in a sense. Many instances of variables in high level code are a delusion. Assignment to them is not supposed to be a real observable effect. Threads have a “debugger” view into your code.
Maybe a better starting point (or even equally valid as assuming literality of code) is every statement can be reordered. Ok then this isn’t quite true. There are certain restrictions.
a = 2; x = 3; a = 3; These assignments aren’t real. Loop reordering
Data Race Freedom
Memory Barriers: a Hardware View for Software Hackers Memory access ordering an introduction string vs weak memory models - programming languages need to be as weak as the weakest hardware they target, or else they can’t generate optimal code. Interesting.
C assignment is not imperative assignment really. The resulting code is not required to have an instruction that corresponds to it. It is more like a functional binding. So then relying on it for some concurrent construct (and concurrency is stateful/imperative?) is nonsensical
Caches. One model of the problem is to imagine your program runs literally, but then randomly flushes a store cache out to main memory.
Binary Translators for Weak Memory Model Architectures
Data Race Freedom
- rust atomics and memory ordering
- twitter question about atomics
- guide to preshing’s blog
- herb sutter atomic weapons
- string vs weak memory models - programming languages need to be as weak as the weakest hardware they target, or else they can’t generate optimal code. Interesting.
- Shared memory concurrency - a tutorial
-
A Promising Semantics for Relaxed-Memory Concurrency Mapping variables to histories Each thread can pick arbitrary values out of histories given timestamp constraints
- thin air problem - Sewell Batty
- Sewell Weak Memory page
- multicodre semantics - making sense of weak memory
- The Problem of Programming Language Concurrency Semantics
- https://www.youtube.com/watch?v=N07tM7xWF1U&ab_channel=CppCon abusing your memory model for fun and profit
- https://news.ycombinator.com/item?id=29109156 what memory model should rust use
People
- Hans Boehm
-
Peter Sewell
-
diy a design and testing thing for weak memory models. Litmus7, herd7
- John regehr thread https://twitter.com/johnregehr/status/1451355617583460355?s=20 Really good
- Is Parallel Programming Hard, And, If So, What Can You Do About It?
- deadlock empire Gives a false sense of sequential consistency?
- list of kernel data races Herlihy is discussed as a given? https://www.amazon.com/Art-Multiprocessor-Programming-Revised-Reprint/dp/0123973376
- https://www.amazon.com/-/en/dp/0470093552 concurrency state models and jabva programsd
- https://www.manning.com/books/c-plus-plus-concurrency-in-action-second-edition
- https://leanpub.com/concurrencywithmodernc
- little book of semaphores
- kernel memory corruption bug from data race prokject zero Concurrent Programming (Andrews) – uses pseudo-code with synchronization/concurrency notation.
- Transactional Memory (Harris) – for non-lock based concurrency.
- LWN an introduction to lockless algorithms
- “Shared-Memory Synchronization” by Michael Scott
- https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/
- http://plv.mpi-sws.org/gps/paper.pdf GPS: Navigating Weak Memory with Ghosts, Protocols, and Separation Weak memory models?
- https://pvk.ca/Blog/2019/01/09/preemption-is-gc-for-memory-reordering/ -
- https://pvk.ca/Blog/2020/07/07/flatter-wait-free-hazard-pointers/
- https://www.cs.rochester.edu/u/scott/papers/1991_TOCS_synch.pdf algorithms for scababe synchronization on shared-memory. Same Michiael Scott. BBN butterfly?
- https://arxiv.org/abs/1106.5730 HOGWILD - parallizingf lock free SGD
- A Primer on Memory Consistency and Cache Coherence
- cooperating seauentail prcoesses by dikstra
- http://www0.cs.ucl.ac.uk/staff/p.ohearn/papers/concurrency.pdf ohearn resources concurrency and local reasoning
- http://pascal.hansotten.com/uploads/pbh/Monitors%20and%20Concurrent%20Pascal.pdf monitors and concrurent pacscal hitsotry - per brinch hansen https://twitter.com/PeterOHearn12/status/1452240719725346827?s=20? “ atomic “release and sleep” is the key as far as I’m concerned, it’s so easy to invent bad solutions without that primitive”
- https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.227.3871&rep=rep1&type=pdf the 12 commandments of synchrnonizatiojn
Concurrency Mutex Lock Concurrent hash table https://en.wikipedia.org/wiki/Concurrent_hash_table barriers
- https://www.youtube.com/watch?v=80ifzK3b8QQ&list=PL1835A90FC78FF8BE&index=1&ab_channel=BartoszMilewski - bartosz on c++11 concurrency
pthread vs std::thread Arc in rust condition variables
-
https://www.manning.com/books/c-plus-plus-concurrency-in-action-second-edition
-
https://www.coursera.org/learn/concurrent-programming-in-java
C++/Java/Ocaml memory models
- https://github.com/herd/CoqCat - interesting thesis too
- https://github.com/herd/herdtools7
- https://github.com/ocamllabs/ocaml-memory-model
- https://en.wikipedia.org/wiki/Memory_model_(programming)
- https://www.hboehm.info/c++mm/ hans boehm link farm - “impossible as a library”
- http://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf - mathematizzing C++ memory model, kodkod, isabelle
Various kinds of relations.
futex - fast userspace mutex
MPI Cilk OpenMP
rust model checker actor system thing https://www.stateright.rs/
Leslie lamport
- Lindsey kuper has a dist sys course It’s probably pretty good https://www.youtube.com/playlist?list=PLNPUF5QyWU8O0Wd8QDh9KaM1ggsxspJ31
Lecture 1. several node,s connected by network, characterized by partial failure cloud vs hpc cloud treat failure as expected and work around it “Eveerything fails all tyhe timne” - werner vogels hpc- partial is total failure. checkpointing
M1 -> M2 x? Posslbe problems:
- lost request
- request from M1 is slow
- M2 crashes
- M2 slow to respond
- seponse is slow
- response could get lost
- corrupt - byzantine faults.
important: send request to another node and don’t receive a response, it is impossible for you to know why
How do real systems cope? Timeout - wait a certain amount of time then assume failure - imperfect solutuojn
max delay d. max process time r. cn’t really do this asynchronous network - no max Palvaro - partiasl failure + unbounded latency
-
https://people.ucsc.edu/~palvaro/ - dedalus - datalog in time and space. https://www.youtube.com/watch?v=R2Aa4PivG0g&ab_channel=StrangeLoopConference bloom blazes, data lineage multiple proofs are fault tolerant. lineage. Why did happen: produce explanation / proof. Cegis? concolic kind of? observability.
-
https://martin.kleppmann.com/ https://www.youtube.com/playlist?list=PLeKd45zvjcDFUEv_ohr_HdUFe97RItdiB
- https://github.com/aphyr/distsys-class
-
https://www.distributedsystemscourse.com/
-
kyle kingsbury
-
chaos engineering - nora jones netflix
- http://book.mixu.net/distsys/
- https://www.youtube.com/watch?v=cQP8WApzIQQ&list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB&ab_channel=MIT6.824%3ADistributedSystems
Lecture 2 - kuper
clocks. scheduling. intervals in time time of day clocks, monotonic clocks NTP network time protocl. not so good for measuring inervals. can jump forward or backward cloudflare leap second bug monotonic - only goes forward
import time time.monotonic() . not comparable between machines
logical clocks only measuring order of events. A can cause B, not vie versa lamport diagrams/ spacetime daigrams
network models synhcrnous network is one where exists an n such no message takes longer than n asynchrnous - one where the is no n such that no message takes longer than n units “partially synchrnous” n exists but you don’t know what it is
happens before - on same process. if a is send,, and b is receive, then a is before c, trasntivity concurrent - not ordered
state -
mate soos talking about adding mpi z3 had mpi but was disabled. armin biere talking about pthreads - pligenling uses pthreads https://twitter.com/SoosMate/status/1375935222907305988?s=20 https://www.researchgate.net/publication/221403132_A_Concurrent_Portfolio_Approach_to_SMT_Solving