See also notes on:

  • Computer Architecture
  • Operating Systems

Memory Models

Weak Memory Models 101

What’s the point? There is a mismatch between both

  1. High level languages (C/Java/Ocaml) and assembly
  2. Assembly and CPU internals

This mismatch is good. The mismatch is abstraction. It gives simpler programming models to users. It allows certain optimizations to happen that the programmer wouldn’t have thought of. The programmer doesn’t necessarily want to say exactly which transistor is flipped at exactly what moment. There is a higher level goal of solving a matrix problem, or writing a record to memory, or swapping two variables.

In a non threaded code we actually don’t care if the follwing two instruction are reordered. store 1 in a store 42 in b

And if there is some reason why it would be faster to do so, sure, go for it compiler. So the program order is not really the execution order insisted upon by single threaded code. It doesn’t however alllow rearrangement of

print(“hello”) print(“world”)

Perhaps however, we are relying on this order to implement some shared primitive.

When you compile, there are intermediate states of the target that don’t really correspond to anything. Depending on the conceptual distance between the source and target, these states may be short. If you point at some particular section of assembly, you probably say, ah yes, it is in the middle of trying to achieve this statement in the source. “In the middle” is not a very precise statement though.

Some things that seem intuitively atomic at the source are not in the target. One statement may become multiple, that a concurrent processor could screw with.

Example programs

Axiomatic Semantics

Lecture Notes on Weak Memory Models – Part I Explaining Relaxed Memory Models with Program Transformations Ori Lahav and Viktor Vafeiadis are two good names to know

rewriting

What is an execution?

Execution graphs. A pile of dependency relations.

Some events happen out of possible events. Some events are reads, some are writes.

Reads read a value that was written somewhere.

Data race = two events, at least one of which is a write (aka not double read).

An overapproximate model is one in which all reads are chaosed. Maybe this even makes sense if stuff gets lost in transit.

A more reasonable thing is that every read must have been written in a different event somewhere. For bool variables, you can probably find any possibly value somewhere, so this actually isn’t even that much of a restricting assumption.

Operational Semantics

A Promising Semantics for Relaxed-Memory Concurrency Promising 2.0: Global Optimizations in Relaxed Memory Concurrency

What is the notion of “state” Keeping history. Possibly you may read from history Traces translate to execution graphs

Timestamps is a useful idea. Timestamps may not be totally ordered. They might be vectors of local times for example

Rewriting

What rewrites of the program are “allowed”. AKA, what compiler optimizations are allowed.

Memory Order vs Program Order

Memory order Program

Program order is the code order in the source. This is vaguely evokes the image of a linear sequence of statements. This is not really what high level languages look like. They have compound statements and expressions, and control flow. The syntax commonly is thought to be tree-like. Is a statement in a loop “before” another? They are executed multiple times, so sure the action during the execution of the loop comes before another, but syntactic statements in the loop correspond to a whole family of actions across all executions of the loop.

Even in single threaded code, execution does not need to follow program order. Non interfering stores and loads are allowed to be slid over each other. Common subexpression elimination means computations may not even happen when they do in the program. Function calls and system calls are probably required to stay in the same order as the source.

We could describe some relatively concrete concurrent machine, with some layer of caching. Does the variable “a” in code exist in the execution of this machine. We are already used to “a” depending on time. But it also not depends on location. It is conceivable that Thread 1, Thread 2, Cache 1 have different opinions on what “a” at the “same” time. Their clocks stay synchronized? Quite an assumption.

The “git” model

When you write to a variable, the new value is not published everywhere. There is a separate event where this value may be published to another thread. Not all variables may be published to the other thread at once. It isn’t persay required in a general memory model that store(a,1) ;store(a,2) has to necessarily publish the store of 1 before the store of 2, or that the store of 1 can’t be seen even after you’ve seen the store of 2. These are certain requirements of in order arrival and dedeuplication of publish messages.

Automated Reasoning

MemSAT: Checking Axiomatic Specifications of Memory Models

https://github.com/rems-project/isla “isla is a symbolic execution engine for Sail, and a tool (sometimes known more specifically as isla-axiomatic) that uses that to evaluate the relaxed-memory behavior of instruction set architectures (ISAs) specified in Sail”

http://diy.inria.fr/doc/index.html herd7

ASP seems like a reasonable language to work in to me. Well, of course it does


% program order po

% transitive
po(A,C) :- po(A,B), po(B,C).

% actions

% write(A, L). action A writes to locatin L
% read(A,L) action A reads from location L


% rf reads from
:- rf(A,B), rf(A,B1), B != B1. % rf is a functional relationship

% we obey the program order
:- po(B,A), rf(A,B).

% memory order. Order on write events to same location. mo(A,A)


Consistency vs Coherence

Sequential Consistency (SC)

Total Store Order (TSO)

Relaxed Memory Models

Fence Instructions

The fact that assembly instructions are a linear stream is a bit silly. This is not a great match for how the unstructions are executed. They are executed out of order and pipelined. The processor can see local data dependncies and creates stalls for those. But nonlocal dependencies on simulatnesous memory access can’t be automatically seen? So you can sort of demarcate regions of the linear stream tht can’t be moved past each other.

VLIW processors for example can schemes for having streams of concurrent operations.

x86

https://stackoverflow.com/questions/20316124/does-it-make-any-sense-to-use-the-lfence-instruction-on-x86-x86-64-processors

https://stackoverflow.com/questions/62465382/how-do-modern-intel-x86-cpus-implement-the-total-order-over-stores/62480523#62480523

C concurrency model on x86 for dummies https://news.ycombinator.com/item?id=32520365

Arm

https://developer.arm.com/documentation/dui0489/c/arm-and-thumb-instructions/miscellaneous-instructions/dmb--dsb--and-isb https://stackoverflow.com/questions/15491751/real-life-use-cases-of-barriers-dsb-dmb-isb-in-arm

Misc

Arguably these are garbage collection topics?

RCU

Read Copy Update https://en.wikipedia.org/wiki/Read-copy-update

Kind of a functional approach. Make new nodes, keeep old ones around.

Epoch

Very similar/identical to RCU? Maintain global epoch counter. Data is more read only than reference counting

Hazard Pointers

https://en.wikipedia.org/wiki/Hazard_pointer

Lock free

https://en.wikipedia.org/wiki/Non-blocking_algorithm https://www.youtube.com/watch?v=RWCadBJ6wTk&ab_channel=NDCConferences cliff click a lock free hash table Compare and swap operations. lock -free - at least one thread is making progress. Consider schedulaers. Consider threads getting killed CAS loop wait free - all threadsmaking progress obstruction free ABA problem Lock-free linked list / stack. Pointers with counters so that you can detect reuse of freed memory

Barriers

Atomics are a way to mark thread shared variables in a sense. Many instances of variables in high level code are a delusion. Assignment to them is not supposed to be a real observable effect. Threads have a “debugger” view into your code.

Maybe a better starting point (or even equally valid as assuming literality of code) is every statement can be reordered. Ok then this isn’t quite true. There are certain restrictions.

a = 2; x = 3; a = 3; These assignments aren’t real. Loop reordering

Data Race Freedom

Memory Barriers: a Hardware View for Software Hackers Memory access ordering an introduction string vs weak memory models - programming languages need to be as weak as the weakest hardware they target, or else they can’t generate optimal code. Interesting.

C assignment is not imperative assignment really. The resulting code is not required to have an instruction that corresponds to it. It is more like a functional binding. So then relying on it for some concurrent construct (and concurrency is stateful/imperative?) is nonsensical

Caches. One model of the problem is to imagine your program runs literally, but then randomly flushes a store cache out to main memory.

Binary Translators for Weak Memory Model Architectures

Data Race Freedom

People

Concurrency Mutex Lock Concurrent hash table https://en.wikipedia.org/wiki/Concurrent_hash_table barriers

  • https://www.youtube.com/watch?v=80ifzK3b8QQ&list=PL1835A90FC78FF8BE&index=1&ab_channel=BartoszMilewski - bartosz on c++11 concurrency

pthread vs std::thread Arc in rust condition variables

  • https://www.manning.com/books/c-plus-plus-concurrency-in-action-second-edition

  • https://www.coursera.org/learn/concurrent-programming-in-java

C++/Java/Ocaml memory models

  • https://github.com/herd/CoqCat - interesting thesis too
  • https://github.com/herd/herdtools7
  • https://github.com/ocamllabs/ocaml-memory-model
  • https://en.wikipedia.org/wiki/Memory_model_(programming)
  • https://www.hboehm.info/c++mm/ hans boehm link farm - “impossible as a library”
  • http://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf - mathematizzing C++ memory model, kodkod, isabelle

Various kinds of relations.

futex - fast userspace mutex

MPI Cilk OpenMP

rust model checker actor system thing https://www.stateright.rs/

Leslie lamport

  • Lindsey kuper has a dist sys course It’s probably pretty good https://www.youtube.com/playlist?list=PLNPUF5QyWU8O0Wd8QDh9KaM1ggsxspJ31

Lecture 1. several node,s connected by network, characterized by partial failure cloud vs hpc cloud treat failure as expected and work around it “Eveerything fails all tyhe timne” - werner vogels hpc- partial is total failure. checkpointing

M1 -> M2 x? Posslbe problems:

  1. lost request
  2. request from M1 is slow
  3. M2 crashes
  4. M2 slow to respond
  5. seponse is slow
  6. response could get lost
  7. corrupt - byzantine faults.

important: send request to another node and don’t receive a response, it is impossible for you to know why

How do real systems cope? Timeout - wait a certain amount of time then assume failure - imperfect solutuojn

max delay d. max process time r. cn’t really do this asynchronous network - no max Palvaro - partiasl failure + unbounded latency

  • https://people.ucsc.edu/~palvaro/ - dedalus - datalog in time and space. https://www.youtube.com/watch?v=R2Aa4PivG0g&ab_channel=StrangeLoopConference bloom blazes, data lineage multiple proofs are fault tolerant. lineage. Why did happen: produce explanation / proof. Cegis? concolic kind of? observability.

  • https://martin.kleppmann.com/ https://www.youtube.com/playlist?list=PLeKd45zvjcDFUEv_ohr_HdUFe97RItdiB

  • https://github.com/aphyr/distsys-class
  • https://www.distributedsystemscourse.com/

  • kyle kingsbury

  • chaos engineering - nora jones netflix

  • http://book.mixu.net/distsys/
  • https://www.youtube.com/watch?v=cQP8WApzIQQ&list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB&ab_channel=MIT6.824%3ADistributedSystems

Lecture 2 - kuper

clocks. scheduling. intervals in time time of day clocks, monotonic clocks NTP network time protocl. not so good for measuring inervals. can jump forward or backward cloudflare leap second bug monotonic - only goes forward

import time time.monotonic() . not comparable between machines

logical clocks only measuring order of events. A can cause B, not vie versa lamport diagrams/ spacetime daigrams

network models synhcrnous network is one where exists an n such no message takes longer than n asynchrnous - one where the is no n such that no message takes longer than n units “partially synchrnous” n exists but you don’t know what it is

happens before - on same process. if a is send,, and b is receive, then a is before c, trasntivity concurrent - not ordered

state -

mate soos talking about adding mpi z3 had mpi but was disabled. armin biere talking about pthreads - pligenling uses pthreads https://twitter.com/SoosMate/status/1375935222907305988?s=20 https://www.researchgate.net/publication/221403132_A_Concurrent_Portfolio_Approach_to_SMT_Solving