Writing Word2Vec from scratch in Rust

7th Aug 2025
7 min read
Tags:
rust,
machine-learning,
word2vec

Introduction

I implemented the famous word2vec algorithm from scratch. Why, you may ask? Well, I wanted to create relationships between ideas and concepts in my notes, and using a full-blown LLM seemed overkill (you either need to use one from a third party or run one locally that will consume gigabytes of RAM). Word2vec, on the other hand, is fun, easy to implement, works fairly well, and gave me a solid glimpse into how similarity functions under the hood.

The word2vec algorithm

The main idea is that you have a bunch of words and you try to make sense of them. I visualize it like a gauge chart where the needle goes from ones side to another according to the similarity between two words.

The word2vec algorithm comes in two flavors: skip-gram and Continuous Bag of Words (CBOW). Both aim to create word embeddings (vectors that capture the semantic meaning of words) but they approach the problem differently.

Skip-Gram

Given a target word, predict the surrounding context words. It tends to do better on infrequent words, since each target appears in multiple training examples (one per context word).

Continuous Bag-of-Words (CBOW)

Given a window of context words, predict the target word. It averages (or sums) context vectors into one hidden representation before prediction, so it’s faster and smoother on frequent words.

I went with the CBOW model.

Implementing CBOW: Tokenization and Setup

The implementation itself is pretty simple. First, you convert your corpus (i.e., your text) into tokens and assign each one a vector. Rather than creating a list of lists, I went with a single massive list containing all the values to avoid wasting memory on unnecessary structures. There are plenty of ways to tokenize a corpus, the original paper uses a Huffman tree for hierarchical softmax, but I opted for a simpler approach with a hashmap and removed stopwords. Each word becomes a key in the hashmap, with its value being the word’s index.

pub fn populate<'a>(mut self, clean_corpus: Cow<'a, str>) -> Self {
        let stop_words = stop_words::get(stop_words::LANGUAGE::English);
        let mut index = 0;
        let sw: HashSet<&str> = HashSet::from_iter(stop_words.iter().map(|s| s.as_str()));
        for word in clean_corpus
            .split_whitespace()
            .filter(|w| w.chars().all(|c| c.is_alphabetic()))
        {
            if !sw.contains(word) {
                if self.words_map.contains_key(word) {
                    let i = self.words_map.get(word).unwrap();
                    self.vec.push(*i);
                } else {
                    //TODO: create the initial embeddings here
                    self.words_map.insert(word.to_string(), index);
                    self.vec.push(index);
                    index += 1;
                };
            }
        }
        self
    }

Next, I used the length of the cleaned corpus to create hidden layers initialized with random values. Randomizing input embeddings ensures each word starts uniquely, so they can learn different directions. The output matrix (the weights that map from the hidden layer to each target word’s score) can start at zero without causing symmetry collapse, because it sits “after” the random input layer. Zero-initialized outputs just “sit quietly” until the first gradient comes in.

pub fn create_matrices(&self) -> (Vec<f32>, Vec<f32>) {
        let normal = Normal::new(self.mean, self.std_dev).unwrap();
        let mut rng = thread_rng();
        let input_matrix: Vec<f32> = (0..self.vocab_size)
            .flat_map(|_| {
                (0..self.embeddings_dimension)
                    .map(|_| normal.sample(&mut rng))
                    .collect::<Vec<f32>>()
            })
            .collect();
        let output_matrix = vec![0.0; self.vocab_size * self.embeddings_dimension];

        (input_matrix, output_matrix)
    }

(Not the most efficient method, I’ll admit, but it gets the job done and isn’t the bottleneck.)

Training the CBOW Model

Now, onto the algorithm itself. The training process has several key steps. The model iterates over the corpus for a set number of epochs, working with pairs of context words and target words since I’m using CBOW.

for epoch in 0..cbow_params.epochs {
        let mut epoch_loss = 0.0;
        for (context, target) in pairs {
            pass(
                context,
                target,
                cbow_params,
                input_layer,
                hidden_layer,
                &mut neu1,
                &mut neu1e,
                &mut epoch_loss,
                &mut rng,
                &vocab_indices,
            );
        }
        tracing::info!(epoch = epoch, epoch_loss = epoch_loss, "Training epoch");
    }

Step 1: Aggregating Context

The first step involves aggregating the values from the context words into the neuron.

for position in 0..neu1.len() {
  let mut f = 0.0;
  for context_index in context {
      let i = position + *context_index * cbow_params.embeddings_dimension;
      f += &input_layer[i];
  }
  // neu1[position] = f;
  neu1[position] = f / context.len() as f32;
}

Step 2: Positive Sampling

Then, we calculate the sigmoid for the target word (positive sampling) to determine the loss and gradient.

let target_l2 = target * cbow_params.embeddings_dimension;
let f = neu1
    .iter()
    .zip(&hidden_layer[target_l2..target_l2 + neu1.len()])
    .map(|(a, b)| a * b)
    .sum::<f32>();

let sig = sigmoid(f);
*epoch_loss += -sig.ln();
let g = (1.0 - sig) * cbow_params.learning_rate;

Step 3: Updating Layers

Next, we update the hidden layer and the second neuron using the gradient.

for c in 0..neu1e.len() {
  neu1e[c] = g * hidden_layer[c + target_l2];
  hidden_layer[c + target_l2] += g * neu1[c];
}

This process trains the model by signaling, “Hey, these are the expected values for this target,” moving related words closer together in the vector space.

Step 4: Negative Sampling

The next step flips this idea with negative sampling. Instead of focusing on context (related words), we pick words that aren’t near the target using random samples from the vocabulary.

let negative_samples = vocab_indices
  .choose_multiple(rng, cbow_params.random_samples + 1)
  .filter(|&&word_idx| word_idx != *target)
  .take(cbow_params.random_samples);

We apply nearly the same operations as before to these negative samples.

for negative_target in negative_samples {
  let l2 = negative_target * cbow_params.embeddings_dimension;
  let f = neu1
      .iter()
      .zip(&hidden_layer[l2..l2 + neu1.len()])
      .map(|(a, b)| a * b)
      .sum::<f32>();

  let sig = sigmoid(f);
  *epoch_loss += -(1.0 - sig).ln();
  let g = (0.0 - sig) * cbow_params.learning_rate;

  for c in 0..neu1e.len() {
    neu1e[c] += g * hidden_layer[c + l2];
    hidden_layer[c + l2] += g * neu1[c];
  }
}

Here’s the key difference: with context words, the gradient is ´(1.0 - sig) * cbow_params.learning_rate´, but with negative samples, it’s ´(0.0 - sig) * cbow_params.learning_rate´. This pushes unrelated words further apart, rather than pulling them closer.

Step 5: Backpropagation

Finally, we update the input layer, passing all the new knowledge back through backpropagation.

context.iter().for_each(|context_index| {
  neu1e.iter().enumerate().for_each(|(k, v)| {
      input_layer[k + context_index * cbow_params.embeddings_dimension] += v;
  })
});

To sum it up, we update the layers positively with context words, repeat the process negatively with unrelated words, and then propagate the changes back to the input layer. That’s the core of it, simple yet effective. You can tweak tons of variables to boost learning, like a dynamic learning rate, the number of random samples, embedding sizes, smarter sample selection, or better tokenization strategies.

Making it blazingly fast

My focus, though, was on execution speed. I used Criterion to benchmark and flamegraphs to pinpoint what was slowing my code down.

SIMD Optimizations

One optimization was leveraging SIMD (Single Instruction, Multiple Data). In Rust, you don’t need to handcraft SIMD (though you could for bigger gains), you can hint to the compiler that a loop is worth vectorizing. Using cargo-asm, I verified SIMD instructions like:

mulps xmm3, xmm1

addps xmm3, xmm1

Reducing Bound Checks

Another tweak was reducing bound checks while keeping the code mostly safe. Instead of iterating over indices, I worked directly with the vectors:

for c in 0..cbow_params.embeddings_dimension {
    neu1e[c] += g * hidden_layer[c + l2];
    hidden_layer[c + l2] += g * neu1[c];
}

for (c, v) in neu1e.iter_mut().enumerate() {
    let hid = &mut hidden_layer[c + l2];
    *v += g * *hid;
    *hid += g * neu1[c];
}

This lets the compiler handle bounds more efficiently.

This is what takes the most time currently

Taking the following time

The plot on the left displays the average time per iteration for this benchmark. The shaded region shows the estimated probability of an iteration taking a certain amount of time, while the line shows the mean.

Fun fact: combining addition and multiplication into one instruction sounded smart but doubled the runtime, check out this graph:

´mul_add´ takes most of the time

Slowing down the execution by 3x

Parallelization with Rayon

Another optimization is, obviously, making it parallel. You know when I said I built this from scratch? Well, I stretched the truth a bit. I used a few libraries: one for the random number generator, another for a list of stop words (yep, I leaned on a library for that), and finally, Rayon to parallelize the iterations. I could’ve rolled my own parallelization, but Rayon’s ergonomics make it a breeze to run things in parallel. Initially, my code was single-threaded, kinda slow, not gonna lie. So, I added parallelization, and boom, new problems. In Rust, you’ve got two straightforward ways to handle shared data: wrap it in an Arc<Mutex> or share nothing and use message passing with channels. But neither of those thrilled me. I wanted speed, and you don’t get speed by locking stuff or doing extra work. So, I turned to raw pointers, you know, those scary things that point to where your data lives in memory? Yeah, those. You’re probably thinking, “How do you use pointers in Rust? Aren’t they unsafe? Isn’t unsafe bad?” Yes, pointers require unsafe, but no, it’s not inherently bad, it’s just Rust saying, “Trust me, bro.” Since I wasn’t worried about race conditions or dangling pointers in this case, I could use them safely. There’s a catch, though: pointers aren’t thread-safe by default. To fix that, I added a little wrapper to make them play nice with Rust’s threading rules:

struct UnsafePtr<T>(*mut T);
unsafe impl<T> Sync for UnsafePtr<T> {}

and then

let input_ptr = UnsafePtr(input_layer.as_mut_ptr());
let hidden_ptr = UnsafePtr(hidden_layer.as_mut_ptr());
let input_len = input_layer.len();
let hidden_len = hidden_layer.len();

(0..cbow_params.epochs).into_par_iter().for_each(|epoch| {
    pairs.par_iter().for_each(|(context, target)| {
        let mut neu1 = vec![0.0; emb_dim];
        let mut neu1e = vec![0.0; emb_dim];
        let mut rng = rand::thread_rng();
        let mut epoch_loss = 0.0;

        // inside of the closure to avoid the smarter new fine-grained closure capturing.
        let _ = &input_ptr;
        let _ = &hidden_ptr;

        unsafe {
            pass(
                context,
                target,
                cbow_params,
                std::slice::from_raw_parts_mut(input_ptr.0, input_len),
                std::slice::from_raw_parts_mut(hidden_ptr.0, hidden_len),
                &mut neu1,
                &mut neu1e,
                &mut epoch_loss,
                &mut rng,
                &vocab_indices,
            );
        }
        tracing::info!(epoch = epoch, epoch_loss = epoch_loss, "Training epoch");
    });
});

With a little of parallelism I managed to reduce the execution by 4x

Results and Reflections

With these changes, my word2vec is screaming fast—7 minutes faster than the original C implementation, which takes 22 minutes for the text8 dataset over 100 epochs. That’s a zippy 15 minutes, folks! Not bad, right? There’s always room to tweak further: better tokenization, smarter negative sampling, or dynamic learning rates. But for now, I’m stoked with a fast, functional word2vec that makes my notes way smarter. And there you have it, my adventure building word2vec from scratch in Rust! You can find the repo here if you want to try it out.