5.1 原子操作的内存顺序

发表于： 2024-12-29 更新于： 2025-03-07 分类于： cpp-concurrency-in-action

字数： 3619 阅读：≈ 8分钟

C++ 内存模型概念

书上着重介绍了 synchronizes-with 和 happens-with 两个关系（relationship）。

Synchronizes-with

The synchronizes-with relationship is something that you can get only between operations on atomic types.

书上还说：一些操作的内部实现会使用原子操作（比如对 mutex 上锁），因此也能达到同步关系，但是同步关系从根本上是由原子操作提供的。

Synchronizes-with 关系提供一种跨线程的 happens-before 关系。

Happens-before

Sequenced-before（单线程）

单个线程中排在前面的操作发生在排在后面的操作之前（比如前一个语句和后一个语句），而且是 strongly happens before。

For a single thread, it’s largely straightforward: if one operation is sequenced before（排在……之前）another, then it also happens before（发生在……之前）it, and strongly-happens-before it.

Tip

我个人觉得书里的 sequenced before 有种单线程的意味，强调单个线程中 A 操作发生在 B 操作之前，好像是在说“代码就是这么写的”。

如果两个操作发生在同一个语句中，那么一般无法确定他们执行的先后顺序。但是有例外：

逗号表达式中，左边的表达式发生在右边之前。
一个表达式的结果作为另外一个表达式的参数，那么这两个表达式的执行就有了先后关系。

需要注意函数参数列表不是逗号表达式。就算确定了函数参数传递顺序，每个参数的评估顺序也是未指定（unspecified）的。由于下面代码中的 get_num() 函数有副作用，因此其执行结果是未知的。

#include <iostream>
void foo(int a, int b)
{
    std::cout << a << ", " << b << std::endl;
}
int get_num()
{
    static int i = 0;
    return ++i;
}
int main()
{
    foo(get_num(), get_num()); // 两次调用谁先执行是不确定的
}

类比一下，在 A + B 中，把 + 看成 operator+(A, B)，A 和 B 的评估顺序也是未指定的。

Inter-thread happens-before（多线程）

At the basic level, inter-thread happens-before is relatively simple and relies on the synchronizes-with relationship introduced in section 5.3.1: if operation A in one thread synchronizes with operation B in another thread, then A inter-thread happens before B.

如果操作 A 和另一线程的操作 B 同步，则有 A 发生在 B 之前，这种被叫做 inter-thread happens before。

术语之间的关系

The same two rules described above apply: if operation A synchronizes-with operation B, or operation A is sequenced-before operation B, then A strongly-happens-before B.

graph TD
    syn["Synchronizes with
    (multithreading)"] ==> ihb["Inter-thread happens before"]
    syn ==> shb
    seq["Sequenced before
    (single-thread)"] ==> shb[Strongly happens before]
    shb --> hb["Happens before"]
    ihb --> hb

Happens-before、strongly-happens-before、inter-thread-happens-before 这几个关系各自都是具有传递性的。要应用传递性时，不能认为他们谁都可以和谁传递。

有一种内存序比较特殊，即 memory_order_consume 是只能参与 inter-thread happens before 的，但是它的使用非常少。书上说 C++ 标准也不建议使用这个内存序！

其他：cppreference 中的 memory_order 描述的比书上更加正式，也更容易放在一起做比较。

C++ 内存序模型

顺序一致模型

如果所有原子操作都用顺序一致的内存序，那么在所有线程看来这些操作的顺序都是相同的——尽管先后顺序可能在运行前不确定。

The semantics of memory_order_seq_cst require a single total ordering over all operations tagged memory_order_seq_cst.

顺序一致之外的两类模型中，不能假设不同线程看到的多个原子操作顺序是一样的（没有单一的全局顺序）。

In the absence of other ordering constraints, the only requirement is that all threads agree on the modification order of each individual variable.

Relaxed 模型

Relaxed 内存序本身不会引入 synchronizes-with 关系，但是遵守其他 happens-before 规则，比如同一个线程中不同语句的先后顺序是要尊重的。在这个前提上，不同变量的 relaxed 内存序的操作可以自由排列。

Release-acquire 模型

Release-acquire 内存序模型下，原子 load 是 acquire 操作，原子 store 是 release 操作，原子 RMW 是 acquire 或 release 操作，或者同时具备这两个属性。Release 操作和 acquire 操作可以在同一个变量上组成同步关系：

A release operation synchronizes-with an acquire operation that reads the value written.

顺序一致模型比 acquire-release 模型的要求更强，因此当两者混合在一起时，顺序一致的 load 就像是有 acquire，顺序一致的 store 就像是有 release。

Release-consume 模型（e.g. RCU）

Consume 在书上也被分为 release-acquire 模型这一类，但是是一个比较复杂的内存序，它描述依赖关系，引入了 dependency-ordered-before 和 carries-a-dependency-to 两个概念。

Carries-a-dependency-to 是单个线程内部的依赖关系，这个可以类比为 sequenced-before。如果操作 A 的结果被作为操作 B 的参数使用，那么 A carries a dependency to B。这也是一种 sequenced-before 关系。例外是：std::kill_dependency 和 && 、|| 、?: 、, 操作符不引入依赖关系。

Dependency-ordered-before 是由一个 release / acq_rel / seq_cst 写操作和另一个线程的对被写变量的 consume 读操作建立起来的联系。如果操作 A 是 dependency-ordered-before 操作 B 的，那么 A inter-thread happens before B。

https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Consume_ordering
If an atomic store in thread A is tagged memory_order_release, an atomic load in thread B from the same variable is tagged memory_order_consume, and the load in thread B reads a value written by the store in thread A, then the store in thread A is dependency-ordered before the load in thread B.

Dependency-ordered-before 关系建立起来之后，只保证被依赖的值是可见的，这意味着它不像 acquire 内存序的同步范围那么广。Consume 语义可以用来保证写入的结果可以被完整看到，书上的例子是：在堆上分配内存创建对象（new），然后将其内容初始化好并存放在 std::atomic<T*> 中，如果它以 consume 内存序被其他线程加载，那么加载这个指针的线程也能看到指针指向内存区域的内容。Cppreference 上说这个内存序在大多数平台上只会影响编译器的优化过程。

https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Consume_ordering
Typical use cases for this ordering involve read access to rarely written concurrent data structures (routing tables, configuration, security policies, firewall rules, etc) and publisher-subscriber situations with pointer-mediated publication, that is, when the producer publishes a pointer through which the consumer can access information: there is no need to make everything else the producer wrote to memory visible to the consumer (which may be an expensive operation on weakly-ordered architectures). An example of such scenario is rcu_dereference（RCU 指 read-copy-update）.

如果一个数据是用 consume 内存序加载出来的，那么它携带了依赖关系。用 std::kill_dependency 可以去除依赖关系，除了对编译器的提示之外，它其实只是对参数的简单复制。此外，[[carries_dependency]] 是 C++11 就加入的属性，用来提示函数的参数或返回值上的依赖传递关系，有了这个提示，编译器就可以避免在函数内外加上 memory fence，从而减少开销。

大多数编译器没有追踪依赖链的能力，这个时候 consume 被实现为 acquire。从 C++17 起，memory_order_consume 的设计开始被重新审视，并且暂时不推荐使用。

Volatile

C++ 中的 volatile 不能用来同步：

它不能保证读写是原子的。
对 volatile glvalues 的读写不能和其他带有可见副作用的操作重排，但只是对本线程有效，对其他线程无效。

Microsoft Visual Studio 的默认设置下，volatile 读带有 acquire 内存序，volatile 写带有 release 内存序……这和 JVM 很类似。最好不要依赖这种行为，会损害可移植性。

Release sequence

Release 类操作可以和 acquire 类操作达成同步，这其中可以穿插任何线程的、任何内存序的 RMW 操作。这意味着 acquire 类操作可以读到 release 类操作写入的值，也可以读到中间任何 RMW 操作完整修改后的值。

书上给的例子是：1 个生产者用 release 内存序来生产，2 个消费者用 acquire 内存序的 fetch_sub 来获取生产结果在数组中的下标。尽管 2 个消费者之间没有同步关系，但它们的 fetch_sub 在 release 序列中，因此能正确保证不会拿到重复的下标。

Fences

也被称为 memory barriers。Fences 的同步点是它本身，而且需要一对 fences 才能完成同步，比如操作 A 和 B（这里写到代码块里是因为函数名太长了）：

std::atomic_thread_fence(std::memory_order_release); // A

// 另一个线程
std::atomic_thread_fence(std::memory_order_acquire); // B

Fences 带来的同步要求比原子变量本身更高，比如 release 屏障禁止屏障之前的读写和屏障之后的写重排。可以参考 Java Memory Order。

在 x86 上，除了序列一致，其他内存屏障不会产生额外的同步指令，只影响编译器对指令的重排。（摘自 cppreference。）

另外可以参考 Acquire and Release Fences Don’t Work the Way You’d Expect，该文章指出对原子变量使用 release 操作和使用 release fence 的语义不同。

A release operation (such as the one on the left) only needs to prevent preceding memory operations from being reordered past itself, but a release fence (such as the one on the right) must prevent preceding memory operations from being reordered past all subsequent writes.

https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence 列举了三种同步关系：

(mo = memory order)

1. fence-atomic synchronization
                         
fence(mo_release)        |
store(mo_any)            |
                         |    load(mo_acquire)

2. atomic-fence synchronization

store(mo_release)        |
                         |    load(mo_any)
                         |    fence(mo_acquire)

3. fence-fence synchronization

fence(mo_release)        |
store(mo_any)            |
                         |    load(mo_any)
                         |    fence(mo_acquire)

Fence 的使用还是比较容易出错的，参考 https://stackoverflow.com/a/43429224/ 中的第三个代码片段。除了选择合适的 fence 位置之外，还要使用原子变量来达到同步效果。

强序架构和弱序架构

参考：

分类：

强序架构
- Sequential Consistency (SC): LoadLoad + LoadStore + StoreLoad + StoreStore
- Total Store Order (TSO): LoadLoad + LoadStore + StoreStore
  - 旧 store 可以和不同地址的新 load 重排，因此去掉了 StoreLoad
  - 例子：x86
- Partial Store Order (PSO): 在 TSO 基础上，同核心上的写入可以重排
弱序架构
- 一般的读写指令没有排序要求，要有同步要求需要显式使用同步指令
- 例子：Arm（顺带一提，现代 arm 同时支持大端和小端，操作系统启动前就要选好，不过一般都用小端。）

SO 回答中提到弱序架构允许更大程度的并行性，同时简化了硬件设计，但是将一部分负担交给了软件（编译器）。强序架构有 LoadLoad 屏障，可能有预测性质的同时 load 技术，在预测错误时会恢复流水线，这导致硬件设计更加复杂。

Note

为什么 TSO 去掉了看起来最重要的 StoreLoad 呢？很多时候我们都需要用 acquire-release 来同步，但其实仔细想想，不需要线程之间同步的时候，这个顺序保证是可以去掉的。而其他几类内存顺序保证我们从来都没有考虑过，只是因为我们平时没有显式同步、缺少了体会，不代表它们不重要。

网络资料显示 StoreLoad 是非常影响性能的一个顺序保证，因为它直接限制了写的异步性。现代处理器架构中，写往往是先提交到缓冲区，随后异步存入内存的。TSO 的设计假设程序员（或者编译器）会通过显式的内存屏障（如 x86 的 mfence 或 lock 前缀）来强制 StoreLoad 排序，而不是让硬件默认承担这个开销。

以上内容参考了 Grok3 的输出。

标准库工具类的同步关系

书上还列举了许多工具类之间的同步关系。

std::thread 的构造和其线程启动函数的执行有同步关系，线程的结束和其他线程的 join() 操作有同步关系。
锁的 unlock() 和其他线程的成功的 lock() 有同步关系。latch 和 barrier 类似。
async、promise、packaged_task 有相应的同步关系。
条件变量本身不提供同步关系，同步关系是由 mutex 提供的。