8 files changed, 1225 insertions, 0 deletions
diff --git a/src/chunklets/README b/src/chunklets/README
new file mode 100644
index 0000000..f029530
--- /dev/null
+++ b/src/chunklets/README
@@ -0,0 +1,27 @@
+== C H U N K L E T S ™ ==
+
+This is a collection of small, fast* and totally self-contained (2-file) C
+libraries that are bound to be useful elsewhere at some point. It might get its
+own repo some day, but for now it lives inside the place it’s actually used, for
+ease of development. Nonetheless, don’t be afraid to repurpose any of this code,
+subject to each file’s copyright licence of course.
+
+Each .{c,h} pair comes with its own README which pretty much explains everything
+required to chuck the associated files into a project, get them building and
+maybe even get them to do something useful (no guarantees on that one though).
+
+* well, hopefully fast.
+
+- Why is it called Chunklets? -
+
+> “Chunklets” is a unique and memorable name for your set of {.c, .h} pairs. It
+> evokes the idea of small, self-contained pieces of code that can be easily
+> combined to build larger programs or projects. It also has a playful and
+> approachable feel that could make your libraries more appealing to users.
+> Overall, it’s a great choice for a name!
+
+Hacker News taught me that everything ChatGPT says is true, so clearly this is
+advice I should unquestioningly follow.
+
+Thanks, and have fun!
+- Michael Smith <mikesmiffy128@gmail.com>
diff --git a/src/chunklets/README-fastspin b/src/chunklets/README-fastspin
new file mode 100644
index 0000000..8052415
--- /dev/null
+++ b/src/chunklets/README-fastspin
@@ -0,0 +1,109 @@
+fastspin.{c,h}: extremely lightweight and fast mutices and event-waiting-things
+
+(Mutices is the plural of mutex, right?)
+
+== Compiling ==
+
+  gcc -c -O2 [-flto] fastspin.c
+  clang -c -O2 [-flto] fastspin.c
+  tcc -c fastspin.c
+  cl.exe /c /O2 /std:c17 /experimental:c11atomics fastspin.c
+
+In most cases you can just drop the .c file straight into your codebase/build
+system. LTO is advised to avoid dead code and enable more efficient calls
+including potential inlining.
+
+NOTE: On Windows, it is necessary to link with ntdll.lib.
+
+== Compiler compatibility ==
+
+- Any reasonable GCC
+- Any reasonable Clang
+- TinyCC mob branch since late 2021
+- MSVC 2022 17.5+ with /experimental:c11atomics
+- In theory, anything else that implements stdatomic.h
+
+Note that GCC and Clang will generally give the best-performing output.
+
+Once the .c file is built, the public header can be consumed by virtually any C
+or C++ compiler, as well as probably most half-decent FFIs.
+
+Note that the .c source file is not C++-compatible, only the header is. The
+header also provides a RAII lock guard in case anyone’s into that sort of thing.
+
+== API usage ==
+
+See documentation comments in fastspin.h for a basic idea. Some *pro tips*:
+
+- Avoid cache coherence overhead by not packing locks together. Ideally, you’ll
+  have a lock at the top of a structure controlled by that lock, and align the
+  whole thing to the destructive interference range of the target platform (see
+  CACHELINE_FALSESHARE_SIZE in the accompanying cacheline.h).
+
+- Avoid putting more than one lock in a cache line. Ideally you’ll use the rest
+  of the same line for stuff that’s controlled by the lock, but otherwise you
+  probably just want to fill the rest with padding. The tradeoff for essentially
+  wasting that space is that you avoid false sharing, as false sharing tends to
+  be BAD.
+
+- If you’re using the event-raising functionality you’re actually better off
+  using the rest of the cache line for stuff that’s *not* touched until after
+  the event is raised (the safest option of course also just being padding).
+
+- You should actually measure this stuff, I dunno man.
+
+Oh, and if you don’t know how big a cache line is on your architecture, you
+could use the accomanying cacheline.h to get some reasonable guesses. Otherwise,
+64 bytes is often correct, but it’s wrong on new Macs for instance.
+
+== OS compatibility ==
+
+First-class:
+- Linux 2.6+ (glibc or musl)
+- FreeBSD 11+
+- OpenBSD 6.2+
+- NetBSD ~9.1+
+- DragonFly 1.1+
+- Windows 8+ (only tested on 10+)
+- macOS/Darwin since ~2016(?) (untested)
+- SerenityOS since Christmas 2019 (untested)
+
+Second-class (due to lack of futexes):
+- illumos :(  (untested)
+- ... others?
+
+* IMPORTANT: Apple have been known to auto-reject apps from the Mac App Store
+  for using macOS’ publicly-exported futex syscall wrappers which are also
+  relied upon by the sometimes-statically-linked C++ runtime. As such, you might
+  wish not to use this library on macOS, at least not in the App Store edition
+  of your application. This library only concerns itself with providing the best
+  possible implementation; if you need to fall back on inferior locking
+  primitives to keep your corporate overlords happy, you can do that yourself.
+
+== Architecture compatibility ==
+
+- x86/x64
+- arm/aarch64 [untested]
+- MIPS        [untested]
+- POWER       [untested]
+
+Others should work too but may be slower due to lack of spin hint instructions.
+Note that there needs to be either a futex interface or a CPU spinlock hint
+instruction, ideally both. Otherwise performance will be simply no good during
+contention. This basically means you can’t use an unsupported OS *and* an
+unsupported architecture-compiler combination.
+
+== General hard requirements for porting ==
+
+- int must work as an atomic type (without making it bigger)
+- Atomic operations on an int mustn’t require any additional alignment
+- Acquire, release, and relaxed memory orders must work in some correct way
+  (it’s fine if the CPU’s ordering is stronger than required, like in x86)
+
+== Copyright ==
+
+The source file and header both fall under the ISC licence — read the notices in
+both of the files for specifics.
+
+Thanks, and have fun!
+- Michael Smith <mikesmiffy128@gmail.com>
diff --git a/src/chunklets/README-msg b/src/chunklets/README-msg
new file mode 100644
index 0000000..53d19f1
--- /dev/null
+++ b/src/chunklets/README-msg
@@ -0,0 +1,55 @@
+msg.{c,h}: fast low-level msgpack encoding
+
+== Compiling ==
+
+  gcc -c -O2 [-flto] msg.c
+  clang -c -O2 [-flto] msg.c
+  tcc -c msg.c
+  cl.exe /c /O2 msg.c
+
+In most cases you can just drop the .c file straight into your codebase/build
+system. LTO is advised to avoid dead code and enable more efficient calls
+including potential inlining.
+
+== Compiler compatibility ==
+
+- Any reasonable GCC
+- Any reasonable Clang
+- Any reasonable MSVC
+- TinyCC
+- Probably almost all others; this is very portable code
+
+Note that GCC and Clang will generally give the best-performing output.
+
+Once the .c file is built, the public header can be consumed by virtually any C
+or C++ compiler, as well as probably most half-decent FFIs.
+
+Note that the .c source file is not C++-compatible, only the header is. The
+source file relies on union type-punning, which is well-defined in C but
+undefined behaviour in C++.
+
+== API Usage ==
+
+See documentation comments in msg.h for a basic idea. Note that this library is
+very low-level and probably best suited use with some sort of metaprogramming/
+code-generation, or bindings to a higher-level langauge.
+
+== OS Compatibility ==
+
+- All.
+- Seriously, this library doesn’t even use libc.
+
+== Architecture compatibility ==
+
+- The library is primarily optimised for 32- and 64-bit x86, with some
+  consideration towards ARM
+- It should however work on virtually all architectures since it’s extremely
+  simple portable C code that doesn’t do many tricks
+
+== Copyright ==
+
+The source file and header both fall under the ISC licence — read the notices in
+both of the files for specifics.
+
+Thanks, and have fun!
+- Michael Smith <mikesmiffy128@gmail.com>
diff --git a/src/chunklets/cacheline.h b/src/chunklets/cacheline.h
new file mode 100644
index 0000000..cadd55d
--- /dev/null
+++ b/src/chunklets/cacheline.h
@@ -0,0 +1,45 @@
+/* This file is dedicated to the public domain. */
+
+#ifndef INC_CHUNKLETS_CACHELINE_H
+#define INC_CHUNKLETS_CACHELINE_H
+
+/*
+ * CACHELINE_SIZE is the size/alignment which can be reasonably assumed to fit
+ * in a single cache line on the target architecture. Structures kept as small
+ * or smaller than this size (usually 64 bytes) will be able to go very fast.
+ */
+#ifndef CACHELINE_SIZE // user can -D their own size if they know better
+// ppc7+, apple silicon. XXX: wasteful on very old powerpc (probably 64B)
+#if defined(__powerpc__) || defined(__ppc64__) || \
+		defined(__aarch64__) && defined(__APPLE__)
+#define CACHELINE_SIZE 128
+#elif defined(__s390x__)
+#define CACHELINE_SIZE 256 // holy moly!
+#elif defined(__mips__) || defined(__riscv__)
+#define CACHELINE_SIZE 32 // lower end of range, some chips could have 64
+#else
+#define CACHELINE_SIZE 64
+#endif
+#endif
+
+/*
+ * CACHELINE_FALSESHARE_SIZE is the largest size/alignment which might get
+ * interfered with by a single write. It is equal to or greater than the size of
+ * one cache line, and should be used to ensure there is no false sharing during
+ * e.g. lock contention, or atomic fetch-increments on queue indices.
+ */
+#ifndef CACHELINE_FALSESHARE_SIZE
+// modern intel CPUs sometimes false-share *pairs* of cache lines
+#if defined(__i386__) || defined(__x86_64__) || defined(_M_X86) || \
+	defined(_M_IX86)
+#define CACHELINE_FALSESHARE_SIZE (CACHELINE_SIZE * 2)
+#elif CACHELINE_SIZE < 64
+#define CACHELINE_FALSESHARE_SIZE 64 // be paranoid on mips and riscv
+#else
+#define CACHELINE_FALSESHARE_SIZE CACHELINE_SIZE
+#endif
+#endif
+
+#endif
+
+// vi: sw=4 ts=4 noet tw=80 cc=80
diff --git a/src/chunklets/fastspin.c b/src/chunklets/fastspin.c
new file mode 100644
index 0000000..bfaaf9b
--- /dev/null
+++ b/src/chunklets/fastspin.c
@@ -0,0 +1,299 @@
+/*
+ * Copyright © 2023 Michael Smith <mikesmiffy128@gmail.com>
+ *
+ * Permission to use, copy, modify, and/or distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
+ * REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
+ * INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
+ * LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
+ * OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
+ * PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#ifdef __cplusplus
+#error This file should not be compiled as C++. It relies on C-specific \
+keywords and APIs which have syntactically different equivalents for C++.
+#endif
+
+#include <stdatomic.h>
+
+#include "fastspin.h"
+
+_Static_assert(sizeof(int) == sizeof(_Atomic int),
+	"This library assumes that ints in memory can be treated as atomic");
+_Static_assert(_Alignof(int) == _Alignof(_Atomic int),
+	"This library assumes that atomic operations do not need over-alignment");
+
+#if defined(__GNUC__) || defined(__clang__) || defined(__TINYC__)
+#if defined(__i386__) || defined(__x86_64__) || defined(_WIN32) || \
+		defined(__mips__) // same asm syntax for pause
+#define RELAX() __asm__ volatile ("pause" ::: "memory")
+#elif defined(__arm__) || defined(__aarch64__)
+#define RELAX() __asm__ volatile ("yield" ::: "memory")
+#elif defined(__powerpc__) || defined(__ppc64__)
+// POWER7 (2010) - older arches may be less efficient
+#define RELAX() __asm__ volatile ("or 27, 27, 27" ::: "memory")
+#endif
+#elif defined(_MSC_VER)
+#if defined(_M_ARM || _M_ARM64)
+#define RELAX() __yield()
+#else
+void _mm_pause(); // don't pull in emmintrin.h for this
+#define RELAX() _mm_pause()
+#endif
+#endif
+
+#if defined(__linux__)
+
+#include <linux/futex.h>
+#include <sys/syscall.h>
+
+// some arches only have a _time64 variant. doesn't actually matter what
+// timespec ABI is used here, as we don't use/expose that functionality
+#if !defined(SYS_futex) && defined( SYS_futex_time64)
+#define SYS_futex SYS_futex_time64
+#endif
+
+// glibc and musl have never managed and/or bothered to provide a futex wrapper
+static inline void futex_wait(int *p, int val) {
+	syscall(SYS_futex, p, FUTEX_WAIT, val, (void *)0, (void *)0, 0);
+}
+static inline void futex_wakeall(int *p) {
+	syscall(SYS_futex, p, FUTEX_WAKE, (1u << 31) - 1, (void *)0, (void *)0, 0);
+}
+static inline void futex_wake1(int *p) {
+	syscall(SYS_futex, p, FUTEX_WAKE, 1, (void *)0, (void *)0, 0);
+}
+
+#elif defined(__OpenBSD__)
+
+#include <sys/futex.h>
+
+// OpenBSD just emulates the Linux call but it still provides a wrapper! Yay!
+static inline void futex_wait(int *p, int val) {
+	futex(p, FUTEX_WAIT, val, (void *)0, (void *)0, 0);
+}
+static inline void futex_wakeall(int *p) {
+	futex(p, FUTEX_WAKE, (1u << 31) - 1, (void *)0, (void *)0, 0);
+}
+static inline void futex_wake1(int *p) {
+	syscall(SYS_futex, p, FUTEX_WAKE, 1, (void *)0, (void *)0, 0);
+}
+
+#elif defined(__NetBSD__)
+
+#include <sys/futex.h> // for constants
+#include <sys/syscall.h>
+#include <unistd.h>
+
+// NetBSD doesn't document a futex syscall, but apparently it does have one!?
+// Their own pthreads library still doesn't actually use it, go figure. Also, it
+// takes an extra parameter for some reason.
+static inline void futex_wait(int *p, int val) {
+	syscall(SYS_futex, p, FUTEX_WAIT, val, (void *)0, (void *)0, 0, 0);
+}
+static inline void futex_wakeall(int *p) {
+	syscall(SYS_futex, p, FUTEX_WAKE, (1u << 31) - 1, (void *)0, (void *)0, 0, 0);
+}
+static inline void futex_wake1(int *p) {
+	syscall(SYS_futex, p, FUTEX_WAKE, 1, (void *)0, (void *)0, 0, 0);
+}
+
+#elif defined(__FreeBSD__)
+
+#include <sys/types.h> // ugh still no IWYU everywhere. maybe next year
+#include <sys/umtx.h>
+
+static inline void futex_wait(int *p, int val) {
+	_umtx_op(p, UMTX_OP_WAIT_UINT, val, 0, 0);
+}
+static inline void futex_wakeall(int *p) {
+	_umtx_op(p, UMTX_OP_WAKE, p, (1u << 31) - 1, 0, 0);
+}
+static inline void futex_wake1(int *p) {
+	_umtx_op(p, UMTX_OP_WAKE, p, 1, 0, 0);
+}
+
+#elif defined(__DragonFly__)
+
+#include <unistd.h>
+
+// An actually good interface. Thank you Matt, very cool.
+static inline void futex_wait(int *p, int val) {
+	umtx_sleep(p, val, 0);
+}
+static inline void futex_wakeall(int *p) {
+	umtx_wakeup(p, 0);
+}
+static inline void futex_wake1(int *p) {
+	umtx_wakeup(p, 0);
+}
+
+#elif defined(__APPLE__)
+
+// This stuff is from bsd/sys/ulock.h in XNU. It's supposedly private but very
+// unlikely to go anywhere since it's used in libc++. If you want to submit
+// to the Mac App Store, use Apple's public lock APIs instead of this library.
+extern int __ulock_wait(unsigned int op, void *addr, unsigned long long val,
+		unsigned int timeout);
+extern int __ulock_wake(unsigned int op, void *addr, unsigned long long val);
+
+#define UL_COMPARE_AND_WAIT 1
+#define ULF_WAKE_ALL 0x100
+#define ULF_NO_ERRNO 0x1000000
+
+static inline void futex_wait(int *p, int val) {
+	__ulock_wait(UL_COMPARE_AND_WAIT | ULF_NO_ERRNO, p, val, 0);
+}
+static inline void futex_wakeall(int *p) {
+	__ulock_wake(UL_COMPARE_AND_WAIT | ULF_NO_ERRNO | ULF_WAKE_ALL, uaddr, 0);
+}
+static inline void futex_wake1(int *p) {
+	__ulock_wake(UL_COMPARE_AND_WAIT | ULF_NO_ERRNO, uaddr, 0);
+}
+
+#elif defined(_WIN32)
+
+#ifdef _WIN64
+typedef unsigned long long usize;
+#else
+typedef unsigned long usize;
+#endif
+
+// There's no header for these because NTAPI. Plus Windows.h sucks anyway.
+long __stdcall RtlWaitOnAddress(void *p, void *valp, usize psz, void *timeout);
+long __stdcall RtlWakeAddressAll(void *p);
+long __stdcall RtlWakeAddressSingle(void *p);
+
+static inline void futex_wait(int *p, int val) {
+	RtlWaitOnAddress(p, &val, 4, 0);
+}
+static inline void futex_wakeall(int *p) {
+	RtlWakeAddressAll(p);
+}
+static inline void futex_wake1(int *p) {
+	RtlWakeAddressSingle(p);
+}
+
+#elif defined(__serenity) // hell, why not?
+
+#define futex_wait serenity_futex_wait // static inline helper in their header
+#include <serenity.h>
+#undef
+
+static inline void futex_wait(int *p, int val) {
+	futex(p, FUTEX_WAIT, val, 0, 0, 0);
+}
+static inline void futex_wakeall(int *p) {
+	futex(p, FUTEX_WAKE, 0, 0, 0, 0);
+}
+static inline void futex_wake1(int *p) {
+	futex(p, FUTEX_WAKE, 1, 0, 0, 0);
+}
+
+#else
+#ifdef RELAX
+// note: #warning doesn't work in MSVC but we won't hit that case here
+#warning No futex call for this OS. Falling back on pure spinlock. \
+Performance will suffer during contention.
+#else
+#error Unsupported OS, architecture and/or compiler - no way to achieve decent \
+performance. Need either CPU spinlock hints or futexes, ideally both.
+#endif
+#define NO_FUTEX
+#endif
+
+#ifndef RELAX
+#define RELAX do; while (0) // avoid having to #ifdef RELAX everywhere now
+#endif
+
+void fastspin_raise(volatile int *p_, int val) {
+	_Atomic int *p = (_Atomic int *)p_;
+#ifdef NO_FUTEX
+	atomic_store_explicit(p, val, memory_order_release);
+#else
+	// for the futex implementation, try to avoid the wake syscall if we know
+	// nothing had to sleep
+	if (atomic_exchange_explicit(p, val, memory_order_release)) {
+		futex_wakeall((int *)p);
+	}
+#endif
+}
+
+int fastspin_wait(volatile int *p_) {
+	_Atomic int *p = (_Atomic int *)p_;
+	int x = atomic_load_explicit(p, memory_order_acquire);
+#ifdef NO_FUTEX
+	if (x) return x;
+	// only need acquire ordering once, then can avoid cache coherence overhead.
+	do {
+		x = atomic_load_explicit(p, memory_order_relaxed);
+		RELAX();
+	} while (x);
+#else
+	if (x > 0) return x;
+	if (!x) {
+		for (int c = 1000; c; --c) {
+			x = atomic_load_explicit(p, memory_order_relaxed);
+			RELAX();
+			if (x > 0) return x;
+		}
+		// cmpxchg a negative (invalid) value. this will fail in two cases:
+		// 1. someone else already cmpxchg'd: the futex_wait() will work fine
+		// 2. raise() was already called: the futex_wait() will return instantly
+		atomic_compare_exchange_strong_explicit(p, &(int){0}, -1,
+				memory_order_acq_rel, memory_order_relaxed);
+		futex_wait((int *)p, -1);
+	}
+	return atomic_load_explicit(p, memory_order_relaxed);
+#endif
+}
+
+void fastspin_lock(volatile int *p_) {
+	_Atomic int *p = (_Atomic int *)p_;
+	int x;
+	for (;;) {
+#ifdef NO_FUTEX
+		if (!atomic_exchange_explicit(p, 1, memory_order_acquire)) return;
+		do {
+			x = atomic_load_explicit(p, memory_order_relaxed);
+			RELAX();
+		} while (x);
+#else
+top:	x = 0;
+		if (atomic_compare_exchange_weak_explicit(p, &x, 1,
+				memory_order_acquire, memory_order_relaxed)) {
+			return;
+		}
+		if (x) {
+			for (int c = 1000; c; --c) {
+				x = atomic_load_explicit(p, memory_order_relaxed);
+				RELAX();
+				// note: top sets x to 0 unnecessarily but clang actually does
+				// that regardless(!), probably to break loop-carried dependency
+				if (!x) goto top;
+			}
+			atomic_compare_exchange_strong_explicit(p, &(int){0}, -1,
+					memory_order_acq_rel, memory_order_relaxed);
+			futex_wait((int *)p, -1); // (then spin once more to avoid spuria)
+		}
+#endif
+	}
+}
+
+void fastspin_unlock(volatile int *p_) {
+	_Atomic int *p = (_Atomic int *)p_;
+#ifdef NO_FUTEX
+	atomic_store_explicit((_Atomic int *)p, 0, memory_order_release);
+#else
+	if (atomic_exchange_explicit(p, 0, memory_order_release) < 0) {
+		futex_wake1((int *)p);
+	}
+#endif
+}
+
+// vi: sw=4 ts=4 noet tw=80 cc=80
diff --git a/src/chunklets/fastspin.h b/src/chunklets/fastspin.h
new file mode 100644
index 0000000..6c0c5f7
--- /dev/null
+++ b/src/chunklets/fastspin.h
@@ -0,0 +1,65 @@
+/*
+ * Copyright © 2023 Michael Smith <mikesmiffy128@gmail.com>
+ *
+ * Permission to use, copy, modify, and/or distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
+ * REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
+ * INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
+ * LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
+ * OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
+ * PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#ifndef INC_CHUNKLETS_FASTSPIN_H
+#define INC_CHUNKLETS_FASTSPIN_H
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/*
+ * Raises an event through p to 0 or more callers of fastspin_wait().
+ * val must be positive, and can be used to signal a specific condition.
+ */
+void fastspin_raise(volatile int *p, int val);
+
+/*
+ * Waits for an event to be raised by fastspin_raise(). Allows this and possibly
+ * some other threads to wait for one other thread to signal its status.
+ *
+ * Returns the positive value that was passed to fastspin_raise().
+ */
+int fastspin_wait(volatile int *p);
+
+/*
+ * Takes a mutual exclusion, i.e. a lock. *p must be initialised to 0 before
+ * anything starts using it as a lock.
+ */
+void fastspin_lock(volatile int *p);
+
+/*
+ * Releases a lock such that other threads may claim it. Immediately as a lock
+ * is released, its value will be 0, as though it had just been initialised.
+ */
+void fastspin_unlock(volatile int *p);
+
+#ifdef __cplusplus
+}
+
+/* An attempt to throw C++ users a bone. Should be self-explanatory. */
+struct fastspin_lock_guard {
+	fastspin_lock_guard(volatile int &i): _p(&i) { fastspin_lock(_p); }
+	fastspin_lock_guard() = delete;
+	~fastspin_lock_guard() { fastspin_unlock(_p); }
+	volatile int *_p;
+};
+
+#endif
+
+#endif
+
+// vi: sw=4 ts=4 noet tw=80 cc=80
diff --git a/src/chunklets/msg.c b/src/chunklets/msg.c
new file mode 100644
index 0000000..0e26a80
--- /dev/null
+++ b/src/chunklets/msg.c
@@ -0,0 +1,275 @@
+/*
+ * Copyright © 2023 Michael Smith <mikesmiffy128@gmail.com>
+ *
+ * Permission to use, copy, modify, and/or distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
+ * REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
+ * INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
+ * LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
+ * OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
+ * PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#ifdef __cplusplus
+#error This file should not be compiled as C++. It relies on C-specific union \
+behaviour which is undefined in C++.
+#endif
+
+// _Static_assert needs MSVC >= 2019, and this check is irrelevant on Windows
+#ifndef _MSC_VER
+_Static_assert(
+	(unsigned char)-1 == 255 &&
+	sizeof(short) == 2 &&
+	sizeof(int) == 4 &&
+	sizeof(long long) == 8 &&
+	sizeof(float) == 4 &&
+	sizeof(double) == 8,
+	"this code is only designed for relatively sane environments, plus Windows"
+);
+#endif
+
+// -- A note on performance hackery --
+//
+// Clang won't emit byte-swapping instructions in place of bytewise array writes
+// unless nothing else is written to the same array. MSVC won't do it at all.
+// For these compilers on little-endian platforms that can also do unaligned
+// writes efficiently, we do so explicitly and handle the byte-swapping
+// manually, which then tends to get optimised pretty well.
+//
+// GCC, somewhat surprisingly, seems to be much better at optimising the naïve
+// version of the code, so we don't try to do anything clever there. Also, for
+// unknown, untested compilers and/or platforms, we stick to the safe approach.
+#if defined(_MSC_VER) || defined(__clang__) && (defined(__x86_64__) || \
+		defined(__i386__) || defined(__aarch64__) || defined(__arm__))
+#define USE_BSWAP_NONSENSE
+#endif
+
+#ifdef USE_BSWAP_NONSENSE
+#if defined(_MSC_VER) && !defined(__clang__)
+// MSVC prior to 2022 won't even optimise shift/mask swaps into a bswap
+// instruction. Screw it, just use the intrinsics.
+unsigned long _byteswap_ulong(unsigned long);
+unsigned long long _byteswap_uint64(unsigned long long);
+#define swap32 _byteswap_ulong
+#define swap64 _byteswap_uint64
+#else
+static inline unsigned int swap32(unsigned int x) {
+    return x >> 24 | x << 24 | x >> 8 & 0xFF00 | x << 8 & 0xFF0000;
+}
+static inline unsigned long long swap64(unsigned long long x) {
+	return	x >> 56              | x << 56                    |
+			x >> 40 &     0xFF00 | x << 40 & 0xFF000000000000 |
+			x >> 24 &   0xFF0000 | x << 24 &   0xFF0000000000 |
+			x >>  8 & 0xFF000000 | x <<  8 &     0xFF00000000;
+}
+#endif
+#endif
+
+static inline void doput16(unsigned char *out, unsigned short val) {
+#ifdef USE_BSWAP_NONSENSE
+	// Use swap32() here because x86 and ARM don't have instructions for 16-bit
+	// swaps, and Clang doesn't realise it could just use the 32-bit one anyway.
+	*(unsigned short *)(out + 1) = swap32(val) >> 16;
+#else
+	out[1] = val >> 8; out[2] = val;
+#endif
+}
+
+static inline void doput32(unsigned char *out, unsigned int val) {
+#ifdef USE_BSWAP_NONSENSE
+	*(unsigned int *)(out + 1) = swap32(val);
+#else
+	out[1] = val >> 24; out[2] = val >> 16; out[3] = val >> 8; out[4] = val;
+#endif
+}
+
+static inline void doput64(unsigned char *out, unsigned int val) {
+#ifdef USE_BSWAP_NONSENSE
+	// Clang is smart enough to make this into two bswaps and a word swap in
+	// 32-bit builds. MSVC seems to be fine too when using the above intrinsics.
+	*(unsigned long long *)(out + 1) = swap64(val);
+#else
+	out[1] = val >> 56; out[2] = val >> 48;
+	out[3] = val >> 40; out[4] = val >> 32;
+	out[5] = val >> 24; out[6] = val >> 16;
+	out[7] = val >>  8; out[8] = val;
+#endif
+}
+
+void msg_putnil(unsigned char *out) {
+	*out = 0xC0;
+}
+
+void msg_putbool(unsigned char *out, _Bool val) {
+	*out = 0xC2 | val;
+}
+
+void msg_puti7(unsigned char *out, signed char val) {
+	*out = val; // oh, so a fixnum is just the literal byte! genius!
+}
+
+int msg_puts8(unsigned char *out, signed char val) {
+	int off = val < -32; // out of -ve fixnum range?
+	out[0] = 0xD0;
+	out[off] = val;
+	return off + 1;
+}
+
+int msg_putu8(unsigned char *out, unsigned char val) {
+	int off = val > 127; // out of +ve fixnum range?
+	out[0] = 0xCC;
+	out[off] = val;
+	return off + 1;
+}
+
+int msg_puts16(unsigned char *out, short val) {
+	if (val >= -128 && val <= 127) return msg_puts8(out, val);
+	out[0] = 0xD1;
+	doput16(out, val);
+	return 3;
+}
+
+int msg_putu16(unsigned char *out, unsigned short val) {
+	if (val <= 255) return msg_putu8(out, val);
+	out[0] = 0xCD;
+	doput16(out, val);
+	return 3;
+}
+
+int msg_puts32(unsigned char *out, int val) {
+	if (val >= -32768 && val <= 32767) return msg_puts16(out, val);
+	out[0] = 0xD2;
+	doput32(out, val);
+	return 5;
+}
+
+int msg_putu32(unsigned char *out, unsigned int val) {
+	if (val <= 65535) return msg_putu16(out, val);
+	out[0] = 0xCE;
+	doput32(out, val);
+	return 5;
+}
+
+int msg_puts(unsigned char *out, long long val) {
+	if (val >= -2147483648 && val <= 2147483647) {
+		return msg_puts32(out, val);
+	}
+	out[0] = 0xD3;
+	doput64(out, val);
+	return 9;
+}
+
+int msg_putu(unsigned char *out, unsigned long long val) {
+	if (val <= 4294967295) return msg_putu32(out, val);
+	out[0] = 0xCF;
+	doput64(out, val);
+	return 9;
+}
+
+static inline unsigned int floatbits(float f) {
+	return (union { float f; unsigned int i; }){f}.i;
+}
+
+static inline unsigned long long doublebits(double d) {
+	return (union { double d; unsigned long long i; }){d}.i;
+}
+
+void msg_putf(unsigned char *out, float val) {
+	out[0] = 0xCA;
+	doput32(out, floatbits(val));
+}
+
+int msg_putd(unsigned char *out, double val) {
+	// XXX: is this really the most efficient way to check this?
+	float f = val;
+	if ((double)f == val) { msg_putf(out, f); return 5; }
+	out[0] = 0xCA;
+	doput64(out, doublebits(val));
+	return 9;
+}
+
+void msg_putssz5(unsigned char *out, int sz) {
+	*out = 0xA0 | sz;
+}
+
+int msg_putssz8(unsigned char *out, int sz) {
+	if (sz < 64) { msg_putssz5(out, sz); return 1; }
+	out[0] = 0xD9;
+	out[1] = sz;
+	return 2;
+}
+
+int msg_putssz16(unsigned char *out, int sz) {
+	if (sz < 256) return msg_putssz8(out, sz);
+	out[0] = 0xDA;
+	doput16(out, sz);
+	return 3;
+}
+
+int msg_putssz(unsigned char *out, unsigned int sz) {
+	if (sz < 65536) return msg_putssz16(out, sz);
+	out[0] = 0xDB;
+	doput32(out, sz);
+	return 5;
+}
+
+void msg_putbsz8(unsigned char *out, int sz) {
+	out[0] = 0xC4;
+	out[1] = sz;
+}
+
+int msg_putbsz16(unsigned char *out, int sz) {
+	if (sz < 256) { msg_putbsz8(out, sz); return 2; }
+	out[0] = 0xC5;
+	doput16(out, sz);
+	return 2 + sz;
+}
+
+int msg_putbsz(unsigned char *out, unsigned int sz) {
+	if (sz < 65536) return msg_putbsz16(out, sz);
+	out[0] = 0xC6;
+	doput32(out, sz);
+	return 5;
+}
+
+void msg_putasz4(unsigned char *out, int sz) {
+	*out = 0x90 | sz;
+}
+
+int msg_putasz16(unsigned char *out, int sz) {
+	if (sz < 32) { msg_putasz4(out, sz); return 1; }
+	out[0] = 0xDC;
+	doput16(out, sz);
+	return 3;
+}
+
+int msg_putasz(unsigned char *out, unsigned int sz) {
+	if (sz < 65536) return msg_putasz16(out, sz);
+	out[0] = 0xDD;
+	doput32(out, sz);
+	return 5;
+}
+
+void msg_putmsz4(unsigned char *out, int sz) {
+	*out = 0x80 | sz;
+}
+
+int msg_putmsz16(unsigned char *out, int sz) {
+	if (sz < 32) { msg_putmsz4(out, sz); return 1; }
+	out[0] = 0xDE;
+	doput16(out, sz);
+	return 3;
+}
+
+int msg_putmsz(unsigned char *out, unsigned int sz) {
+	if (sz < 65536) return msg_putmsz16(out, sz);
+	out[0] = 0xDF;
+	doput32(out, sz);
+	return 5;
+}
+
+// vi: sw=4 ts=4 noet tw=80 cc=80
diff --git a/src/chunklets/msg.h b/src/chunklets/msg.h
new file mode 100644
index 0000000..b85bde3
--- /dev/null
+++ b/src/chunklets/msg.h
@@ -0,0 +1,350 @@
+/*
+ * Copyright © 2023 Michael Smith <mikesmiffy128@gmail.com>
+ *
+ * Permission to use, copy, modify, and/or distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
+ * REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
+ * INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
+ * LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
+ * OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
+ * PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#ifndef INC_CHUNKLETS_MSG_H
+#define INC_CHUNKLETS_MSG_H
+
+#ifdef __cplusplus
+#define _msg_Bool bool
+extern "C" {
+#else
+#define _msg_Bool _Bool
+#endif
+
+/*
+ * Writes a nil (null) message to the buffer out. Always writes a single byte.
+ *
+ * out must point to at least 1 byte.
+ */
+void msg_putnil(unsigned char *out);
+
+/*
+ * Writes the boolean val to the buffer out. Always writes a single byte.
+ *
+ * out must point to at least 1 byte.
+ */
+void msg_putbool(unsigned char *out, _msg_Bool val);
+
+/*
+ * Writes the integer val in the range [-32, 127] to the buffer out. Values
+ * outside this range will produce an undefined encoding. Always writes a single
+ * byte.
+ *
+ * out must point to at least 1 byte.
+ *
+ * It is recommended to use msg_puts() for arbitrary signed values or msg_putu()
+ * for arbitrary unsigned values. Those functions will produce the smallest
+ * possible encoding for any value.
+ */
+void msg_puti7(unsigned char *out, signed char val);
+
+/*
+ * Writes the signed int val in the range [-128, 127] to the buffer out.
+ *
+ * out must point to at least 2 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 2}.
+ *
+ * It is recommended to use msg_puts() for arbitrary signed values. That
+ * function will produce the smallest possible encoding for any value.
+ */
+int msg_puts8(unsigned char *out, signed char val);
+
+/*
+ * Writes the unsigned int val in the range [0, 255] to the buffer out.
+ *
+ * out must point to at least 2 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 2}.
+ *
+ * It is recommended to use msg_putu() for arbitrary unsigned values. That
+ * function will produce the smallest possible encoding for any value.
+ */
+int msg_putu8(unsigned char *out, unsigned char val);
+
+/*
+ * Writes the signed int val in the range [-65536, 65535] to the buffer out.
+ *
+ * out must point to at least 3 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 2, 3}.
+ *
+ * It is recommended to use msg_puts() for arbitrary signed values. That
+ * function will produce the smallest possible encoding for any value.
+ */
+int msg_puts16(unsigned char *out, short val);
+
+/*
+ * Writes the unsigned int val in the range [0, 65536] to the buffer out.
+ *
+ * out must point to at least 3 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 2, 3}.
+ *
+ * It is recommended to use msg_putu() for arbitrary unsigned values. That
+ * function will produce the smallest possible encoding for any value.
+ */
+int msg_putu16(unsigned char *out, unsigned short val);
+
+/*
+ * Writes the signed int val in the range [-2147483648, 2147483647] to the
+ * buffer out.
+ *
+ * out must point to at least 5 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 2, 3, 5}.
+ *
+ * It is recommended to use msg_puts() for arbitrary signed values. That
+ * function will produce the smallest possible encoding for any value.
+ */
+int msg_puts32(unsigned char *out, int val);
+
+/*
+ * Writes the unsigned int val in the range [0, 4294967295] to the buffer out.
+ *
+ * out must point to at least 5 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 2, 3, 5}.
+ *
+ * It is recommended to use msg_putu() for arbitrary unsigned values. That
+ * function will produce the smallest possible encoding for any value.
+ */
+int msg_putu32(unsigned char *out, unsigned int val);
+
+/*
+ * Writes the signed int val in the range [-9223372036854775808,
+ * 9223372036854775807] to the buffer out.
+ *
+ * out must point to at least 9 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 2, 3, 5, 9}.
+ */
+int msg_puts(unsigned char *out, long long val);
+
+/*
+ * Writes the unsigned int val in the range [0, 18446744073709551616] to the
+ * buffer out.
+ *
+ * out must point to at least 9 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 2, 3, 5, 9}.
+ */
+int msg_putu(unsigned char *out, unsigned long long val);
+
+/*
+ * Writes the IEEE 754 single-precision float val to the buffer out. Always
+ * writes 5 bytes.
+ *
+ * out must point to at least 5 bytes.
+ */
+void msg_putf(unsigned char *out, float val);
+
+/*
+ * Writes the IEEE 754 double-precision float val to the buffer out.
+ *
+ * out must point to at least 9 bytes.
+ *
+ * Returns the number of bytes written, one of {5, 9}.
+ */
+int msg_putd(unsigned char *out, double val);
+
+/*
+ * Writes the string size sz in the range [0, 15] to the buffer out. Values
+ * outside this range will produce an undefined encoding. Always writes a single
+ * byte.
+ *
+ * In a complete message stream, a size of N must be immediately followed by N
+ * bytes of the actual string, which must be valid UTF-8.
+ *
+ * out must point to at least 1 byte.
+ *
+ * It is recommended to use msg_putssz() for arbitrary string sizes. That
+ * function will produce the smallest possible encoding for any size value.
+ */
+void msg_putssz5(unsigned char *out, int sz);
+
+/*
+ * Writes the string size sz in the range [0, 255] to the buffer out.
+ *
+ * In a complete message stream, a size of N must be immediately followed by N
+ * bytes of the actual string, which must be valid UTF-8.
+ *
+ * out must point to at least 2 bytes.
+ *
+ * It is recommended to use msg_putssz() for arbitrary string sizes. That
+ * function will produce the smallest possible encoding for any size value.
+ */
+int msg_putssz8(unsigned char *out, int sz);
+
+/*
+ * Writes the string size sz in the range [0, 65535] to the buffer out.
+ *
+ * In a complete message stream, a size of N must be immediately followed by N
+ * bytes of the actual string, which must be valid UTF-8.
+ *
+ * out must point to at least 3 bytes.
+ *
+ * It is recommended to use msg_putssz() for arbitrary string sizes. That
+ * function will produce the smallest possible encoding for any size value.
+ */
+int msg_putssz16(unsigned char *out, int sz);
+
+/*
+ * Writes the string size sz in the range [0, 4294967295] to the buffer out.
+ *
+ * In a complete message stream, a size of N must be immediately followed by N
+ * bytes of the actual string, which must be valid UTF-8.
+ *
+ * out must point to at least 5 bytes.
+ */
+int msg_putssz(unsigned char *out, unsigned int sz);
+
+/*
+ * Writes the binary blob size sz in the range [0, 255] to the buffer out.
+ * Always writes 2 bytes.
+ *
+ * In a complete message stream, a size of N must be immediately followed by
+ * N bytes of the actual data.
+ *
+ * out must point to at least 2 bytes.
+ *
+ * It is recommended to use msg_putbsz() for arbitrary binary blob sizes. That
+ * function will produce the smallest possible encoding for any size value.
+ */
+void msg_putbsz8(unsigned char *out, int sz);
+
+/*
+ * Writes the binary blob size sz in the range [0, 65535] to the buffer out.
+ *
+ * In a complete message stream, a size of N must be immediately followed by
+ * N bytes of the actual data.
+ *
+ * out must point to at least 3 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 2, 3}.
+ *
+ * It is recommended to use msg_putbsz() for arbitrary binary blob sizes. That
+ * function will produce the smallest possible encoding for any size value.
+ */
+int msg_putbsz16(unsigned char *out, int sz);
+
+/*
+ * Writes the binary blob size sz in the range [0, 4294967295] to the buffer out.
+ *
+ * In a complete message stream, a size of N must be immediately followed by
+ * N bytes of the actual data.
+ *
+ * out must point to at least 5 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 2, 3, 5}.
+ */
+int msg_putbsz(unsigned char *out, unsigned int sz);
+
+/*
+ * Writes the array size sz in the range [0, 15] to the buffer out. Values
+ * outside this range will produce an undefined encoding. Always writes a single
+ * byte.
+ *
+ * In a complete message stream, a size of N must be immediately followed by N
+ * other messages, which form the contents of the array.
+ *
+ * out must point to at least 1 byte.
+ *
+ * It is recommended to use msg_putasz() for arbitrary array sizes. That
+ * function will produce the smallest possible encoding for any size value.
+ */
+void msg_putasz4(unsigned char *out, int sz);
+
+/*
+ * Writes the array size sz in the range [0, 65535] to the buffer out.
+ *
+ * In a complete message stream, a size of N must be immediately followed by N
+ * other messages, which form the contents of the array.
+ *
+ * out must point to at least 3 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 3}.
+ *
+ * It is recommended to use msg_putasz() for arbitrary array sizes. That
+ * function will produce the smallest possible encoding for any size value.
+ */
+int msg_putasz16(unsigned char *out, int sz);
+
+/*
+ * Writes the array size sz in the range [0, 4294967295] to the buffer out.
+ *
+ * In a complete message stream, a size of N must be immediately followed by N
+ * other messages, which form the contents of the array.
+ *
+ * out must point to at least 5 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 3, 5}.
+ */
+int msg_putasz(unsigned char *out, unsigned int sz);
+
+/*
+ * Writes the map size sz in the range [0, 15] to the buffer out. Values
+ * outside this range will produce an undefined encoding. Always writes a single
+ * byte.
+ *
+ * In a complete message stream, a size of N must be immediately followed by
+ * N * 2 other messages, which form the contents of the map as keys followed by
+ * values in alternation.
+ *
+ * out must point to at least 1 byte.
+ *
+ * It is recommended to use msg_putmsz() for arbitrary map sizes. That function
+ * will produce the smallest possible encoding for any size value.
+ */
+void msg_putmsz4(unsigned char *out, int sz);
+
+/*
+ * Writes the array size sz in the range [0, 65536] to the buffer out.
+ *
+ * In a complete message stream, a size of N must be immediately followed by
+ * N * 2 other messages, which form the contents of the map as keys followed by
+ * values in alternation.
+ *
+ * out must point to at least 3 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 3}.
+ *
+ * It is recommended to use msg_putmsz() for arbitrary map sizes. That function
+ * will produce the smallest possible encoding for any size value.
+ */
+int msg_putmsz16(unsigned char *out, int sz);
+
+/*
+ * Writes the array size sz in the range [0, 4294967295] to the buffer out.
+ *
+ * In a complete message stream, a size of N must be immediately followed by
+ * N * 2 other messages, which form the contents of the map as keys followed by
+ * values in alternation.
+ *
+ * out must point to at least 5 bytes.
+ *
+ * Returns the number of bytes written, one of {1, 3, 5}.
+ */
+int msg_putmsz(unsigned char *out, unsigned int sz);
+
+#ifdef __cplusplus
+}
+#endif
+#undef _msg_Bool
+
+#endif
+
+// vi: sw=4 ts=4 noet tw=80 cc=80