Lockfree ring buffer in Rust

28th Jun 2025
5 min read
Tags:
rust,
lockfree,
concurrency

Introduction

After tinkering with a custom Raft implementation, I used channels to share data between modules. I leaned on the Tokio library’s channels, which are blazing fast. But channels are a solved problem, and I hadn’t yet put my own spin on them. So, I decided to roll my own from scratch.

First iteration

A natural starting point for implementing a channel is to wrap an array (static or dynamic) in a lock, such as a Mutex or RwLock, or use a semaphore to control access. This approach works well in some scenarios, is efficient enough for many use cases, and is straightforward to implement, as shown below:

#[derive(Debug)]
pub struct MutexRingBuffer<T> {
    buffer: Mutex<VecDeque<T>>,
    capacity: usize,
}

impl<T> MutexRingBuffer<T> {
    pub fn new(capacity: usize) -> Self {
        Self {
            buffer: Mutex::new(VecDeque::with_capacity(capacity)),
            capacity,
        }
    }

    pub fn try_push(&self, value: T) -> Result<(), T> {
        let mut buffer = match self.buffer.lock() {
            Ok(lock) => lock,
            Err(_) => return Err(value),
        };
        if buffer.len() >= self.capacity {
            return Err(value);
        }
        buffer.push_back(value);
        Ok(())
    }

    pub fn try_read(&self) -> Result<T, ()> {
        let mut buffer = self.buffer.lock().map_err(|_| ())?;
        buffer.pop_front().ok_or(())
    }
}

#[derive(Debug, Clone)]
pub struct MutexRingBufferHandle<T>(Arc<MutexRingBuffer<T>>);

impl<T> MutexRingBufferHandle<T> {
    pub fn new(capacity: usize) -> Self {
        Self(Arc::new(MutexRingBuffer::new(capacity)))
    }

    pub fn try_push(&self, value: T) -> Result<(), T> {
        self.0.try_push(value)
    }

    pub fn try_read(&self) -> Result<T, ()> {
        self.0.try_read()
    }
}

But this isn’t that exciting… This implementation is simple, efficient, and easy to understand. However, it didn’t feel like my creation. I wanted to build something ‘from scratch’, without relying on locks.

meme

Lockfree ring-buffer

When you think in performance and arrays, one structure that come to mind is usually a ring buffer. The idea is simple: you cycle through a fixed-size array, overwriting elements that have been read. You track the read and write positions with a head and tail.

struct RingBuffer<T, const N: usize> {
    values: [Slot<T>; N],
    head: CachePadded<AtomicUsize>, //next read index
    tail: CachePadded<AtomicUsize>, //next write index
}

The Slot is a structure that lets us track whether a slot has been read or written:

#[derive(Debug)]
struct Slot<T> {
    value: UnsafeCell<MaybeUninit<T>>,
    sequence: CachePadded<AtomicUsize>,
}

A lock-free data structure enables multiple threads to access it concurrently without using locks. Instead, it relies on atomic operations, low-level CPU instructions that execute indivisibly, ensuring thread safety without blocking. To acheive lockfree concurreny you dont need a lot of code, it is easier that it seems. Peek at how Mutex or channels are implemented in Rust’s standard library, and you’ll see they use the same atomic tricks.

This is what the push method looks like in my lockfree ring-buffer implementation. In the push method, we first load the current tail index atomically. We check if tail is still behind head (indicating available slots in the buffer). If so, we attempt to increment tail using compare_exchange_weak to reserve a slot. If the operation fails (due to concurrent modifications), we retry using a spin loop, which avoids blocking and allows other threads to proceed. Once tail is incremented, we compute the array index and access the corresponding slot. We verify the slot’s sequence number matches the expected index, ensuring it’s ready for writing. If not, we spin and retry. Once ready, we write the value to the slot and increment its sequence number to signal completion. The sequence number tracks the slot’s state: adding N signals it’s ready for writing, while adding 1 signals it’s ready for reading. This ensures correct synchronization between producers and the consumer.

    fn push(&self, value: T) {
        let push_idx = loop {
            let push_idx = self.tail.load(Ordering::Acquire);

            if push_idx >= (self.head.load(Ordering::Acquire) + N) {
                std::hint::spin_loop();
                continue;
            }

            if self
                .tail
                .compare_exchange_weak(push_idx, push_idx + 1, Ordering::AcqRel, Ordering::Relaxed)
                .is_ok()
            {
                break push_idx;
            }
            std::hint::spin_loop();
        };
        let idx = push_idx % N;

        let slot = unsafe { &self.values.get_unchecked(idx) };

        loop {
            let slot_idx = slot.sequence.load(Ordering::Acquire);
            if slot_idx != push_idx {
                std::hint::spin_loop();
                continue;
            }

            unsafe {
                let slot_value = slot.value.get();
                (*slot_value).write(value);
            }

            slot.sequence.store(slot_idx + 1, Ordering::Release);
            break;
        }
    }

To build a lock-free data structure, you need to understand atomic operations and memory ordering. The ordering semantics, Acquire and Release, are similar to those used in locks. The Acquire ordering ensures that loads see the latest state modified by other threads. The Release ordering ensures that stores make prior writes visible to other threads. The compare_exchange_weak with AcqRel combines both semantics to safely update tail while maintaining consistency.

You can easly extrapolate and figure out what the read method looks like.

Making the structure faster

This block of code boosted my implementation’s performance by 20%. This struct aligns data to the CPU’s cache line size, reducing false sharing between threads. False sharing occurs when multiple threads access different variables on the same cache line, causing unnecessary cache invalidation. By padding the indices, we ensure each resides on its own cache line.

/// Pads and aligns a value to the length of a cache line.
#[derive(Clone, Copy, Default, Hash, PartialEq, Eq, Debug)]
// Starting from Intel's Sandy Bridge, spatial prefetcher is now pulling pairs of 64-byte cache
// lines at a time, so we have to align to 128 bytes rather than 64.
//
// Sources:
// - https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
// - https://github.com/facebook/folly/blob/1b5288e6eea6df074758f877c849b6e73bbb9fbb/folly/lang/Align.h#L107
//
// ARM's big.LITTLE architecture has asymmetric cores and "big" cores have 128-byte cache line size.
//
// Sources:
// - https://www.mono-project.com/news/2016/09/12/arm64-icache/
//
// powerpc64 has 128-byte cache line size.
//
// Sources:
// - https://github.com/golang/go/blob/3dd58676054223962cd915bb0934d1f9f489d4d2/src/internal/cpu/cpu_ppc64x.go#L9
#[cfg_attr(
    any(target_arch = "x86_64", target_arch = "aarch64", target_arch = "powerpc64",),
    repr(align(128))
)]
// arm, mips, mips64, and riscv64 have 32-byte cache line size.
//
// Sources:
// - https://github.com/golang/go/blob/3dd58676054223962cd915bb0934d1f9f489d4d2/src/internal/cpu/cpu_arm.go#L7
// - https://github.com/golang/go/blob/3dd58676054223962cd915bb0934d1f9f489d4d2/src/internal/cpu/cpu_mips.go#L7
// - https://github.com/golang/go/blob/3dd58676054223962cd915bb0934d1f9f489d4d2/src/internal/cpu/cpu_mipsle.go#L7
// - https://github.com/golang/go/blob/3dd58676054223962cd915bb0934d1f9f489d4d2/src/internal/cpu/cpu_mips64x.go#L9
// - https://github.com/golang/go/blob/3dd58676054223962cd915bb0934d1f9f489d4d2/src/internal/cpu/cpu_riscv64.go#L7
#[cfg_attr(
    any(
        target_arch = "arm",
        target_arch = "mips",
        target_arch = "mips32r6",
        target_arch = "mips64",
        target_arch = "mips64r6",
        target_arch = "riscv64",
    ),
    repr(align(32))
)]
// s390x has 256-byte cache line size.
//
// Sources:
// - https://github.com/golang/go/blob/3dd58676054223962cd915bb0934d1f9f489d4d2/src/internal/cpu/cpu_s390x.go#L7
#[cfg_attr(target_arch = "s390x", repr(align(256)))]
// x86 and wasm have 64-byte cache line size.
//
// Sources:
// - https://github.com/golang/go/blob/dda2991c2ea0c5914714469c4defc2562a907230/src/internal/cpu/cpu_x86.go#L9
// - https://github.com/golang/go/blob/3dd58676054223962cd915bb0934d1f9f489d4d2/src/internal/cpu/cpu_wasm.go#L7
//
// All others are assumed to have 64-byte cache line size.
#[cfg_attr(
    not(any(
        target_arch = "x86_64",
        target_arch = "aarch64",
        target_arch = "powerpc64",
        target_arch = "arm",
        target_arch = "mips",
        target_arch = "mips32r6",
        target_arch = "mips64",
        target_arch = "mips64r6",
        target_arch = "riscv64",
        target_arch = "s390x",
    )),
    repr(align(64))
)]
pub struct CachePadded<T> {
    value: T,
}

impl<T> CachePadded<T> {
    /// Pads and aligns a value to the length of a cache line.
    pub fn new(value: T) -> CachePadded<T> {
        CachePadded::<T> { value }
    }
}

impl<T> Deref for CachePadded<T> {
    type Target = T;

    fn deref(&self) -> &T {
        &self.value
    }
}

impl<T> DerefMut for CachePadded<T> {
    fn deref_mut(&mut self) -> &mut T {
        &mut self.value
    }
}

You’ll find this trick in Rust’s standard library channels and the crossbeam crate. I shamelessly borrowed it from Rust’s source code. Essentially, it makes your indices larger to optimize cache access.

Where is this used? Right here:

#[derive(Debug)]
struct RingBuffer<T, const N: usize> {
    values: [Slot<T>; N],
    head: CachePadded<AtomicUsize>, //next read index
    tail: CachePadded<AtomicUsize>, //next write index
}

On the indeces.

Measurements

Alright, let’s get serious for a second, we’re not just throwing code at the wall to see what sticks. To make sure my lock-free ring buffer wasn’t a complete disaster, I benchmarked it. And honestly? It’s pretty darn good. But it’s up against a heavyweight: Rust’s standard library MPSC channel. For comparison, I also tested two naive implementations: a VecDeque wrapped in a Mutex and one in an RwLock.

The benchmarks were run using the criterion library, which gave me nice, reliable measurements. Tests were conducted on my machine machine

with the following setup:

const CAPACITY: usize = 1024;
const OPERATIONS: usize = 100_000;

The tests cover two scenarios: single-producer single-consumer (SPSC) and multiple-producer single-consumer (MPSC). They’re straightforward: one or more threads write values to the buffer, and another reads them as fast as possible.

SPSC

In the single-producer single-consumer scenario, one thread writes to the buffer while another reads from it.

results

The average times are:

My ring buffer: 6 ms
Mutex: 10 ms
RwLock: 12 ms
Standard MPSC channel: 3 ms

Here’s an example of the test: Example:

let buffer = RingBufferHandle::<i32, CAPACITY>::new();
let producer_buffer = buffer.clone();
let consumer_buffer = buffer.clone();

let producer = std::thread::spawn(move || {
for i in 0..OPERATIONS {
    producer_buffer.push(black_box(i as i32))
}
});

let consumer = std::thread::spawn(move || {
let mut count = 0;
while count < OPERATIONS {
    let value = consumer_buffer.read();
    black_box(value);
    count += 1;
}
});

producer.join().unwrap();
consumer.join().unwrap();

MPSC

In the multiple-producer single-consumer scenario, two producers write to the buffer while one consumer reads.

results

The average times are:

My ring buffer: 14 ms
Mutex: 26 ms
RwLock: 28 ms
Standard MPSC channel: 7 ms

Here’s the test code:

let (producer_buffer, consumer_buffer) = channel::<i32>();
let producer_buffer2 = producer_buffer.clone();

let producer = std::thread::spawn(move || {
for i in 0..OPERATIONS {
    while producer_buffer.send(black_box(i as i32)).is_err() {
        std::thread::yield_now();
    }
}
});

let producer2 = std::thread::spawn(move || {
for i in 0..OPERATIONS {
    while producer_buffer2.send(black_box(i as i32)).is_err() {
        std::thread::yield_now();
    }
}
});

let consumer = std::thread::spawn(move || {
let mut count = 0;
while count < OPERATIONS * 2 {
    if let Ok(value) = consumer_buffer.recv() {
        black_box(value);
        count += 1;
    } else {
        std::thread::yield_now();
    }
}
});

producer.join().unwrap();
producer2.join().unwrap();
consumer.join().unwrap();

Observations

I tried some micro-optimizations, like using bitwise operations instead of modulo:

#[inline(always)]
fn wrap_index<const N: usize>(idx: usize) -> usize {
    idx & (N - 1)
}

And implementing an exponential backoff retry instead of a plain spin loop:

let mut backoff = 1;
let push_idx = loop {
    let push_idx = self.tail.load(Ordering::Acquire);

    if push_idx >= (self.head.load(Ordering::Acquire) + N) {
        for _ in 0..backoff {
            std::hint::spin_loop();
        }
        backoff = (backoff * 2).min(64); // 1, 2, 4, 8, 16, 32, 64, 64, 64...
        continue;
    }

    if self
        .tail
        .compare_exchange_weak(push_idx, push_idx + 1, Ordering::AcqRel, Ordering::Relaxed)
        .is_ok()
        {
            break push_idx;
        }
    backoff = 1;
    std::hint::spin_loop();
};

But these tweaks didn’t move the needle, my implementation is still about half as fast as the standard library’s MPSC channel.

Conclusion

Here’s the deal: no matter how much you polish micro-optimizations like bitwise indexing or exponential backoff, the real magic happens in the architecture. Rust’s standard MPSC channels and crossbeam lean on linked lists, which seem to dodge the contention bottlenecks my ring buffer hits. Building a lock-free ring buffer was a deep dive into atomics, memory ordering semantics, and cache-line padding. Totally worth it. If you’re curious, grab my code, benchmark it on your rig, and see where you can push it further!