Writing Word2Vec from scratch in Rust
Introduction
I implemented the famous word2vec algorithm from scratch. Why, you may ask? Well, I wanted to create relationships between ideas and concepts in my notes, and using a full-blown LLM seemed overkill (you either need to use one from a third party or run one locally that will consume gigabytes of RAM). Word2vec, on the other hand, is fun, easy to implement, works fairly well, and gave me a solid glimpse into how similarity functions under the hood.
The word2vec algorithm
The main idea is that you have a bunch of words and you try to make sense of them. I visualize it like a gauge chart where the needle goes from ones side to another according to the similarity between two words.
The word2vec algorithm comes in two flavors: skip-gram and Continuous Bag of Words (CBOW). Both aim to create word embeddings (vectors that capture the semantic meaning of words) but they approach the problem differently.
Skip-Gram
Given a target word, predict the surrounding context words. It tends to do better on infrequent words, since each target appears in multiple training examples (one per context word).
Continuous Bag-of-Words (CBOW)
Given a window of context words, predict the target word. It averages (or sums) context vectors into one hidden representation before prediction, so it’s faster and smoother on frequent words.
I went with the CBOW model.
Implementing CBOW: Tokenization and Setup
The implementation itself is pretty simple. First, you convert your corpus (i.e., your text) into tokens and assign each one a vector. Rather than creating a list of lists, I went with a single massive list containing all the values to avoid wasting memory on unnecessary structures. There are plenty of ways to tokenize a corpus, the original paper uses a Huffman tree for hierarchical softmax, but I opted for a simpler approach with a hashmap and removed stopwords. Each word becomes a key in the hashmap, with its value being the word’s index.
Next, I used the length of the cleaned corpus to create hidden layers initialized with random values. Randomizing input embeddings ensures each word starts uniquely, so they can learn different directions. The output matrix (the weights that map from the hidden layer to each target word’s score) can start at zero without causing symmetry collapse, because it sits “after” the random input layer. Zero-initialized outputs just “sit quietly” until the first gradient comes in.
(Not the most efficient method, I’ll admit, but it gets the job done and isn’t the bottleneck.)
Training the CBOW Model
Now, onto the algorithm itself. The training process has several key steps. The model iterates over the corpus for a set number of epochs, working with pairs of context words and target words since I’m using CBOW.
for epoch in 0..cbow_params.epochs
Step 1: Aggregating Context
The first step involves aggregating the values from the context words into the neuron.
for position in 0..neu1.len
Step 2: Positive Sampling
Then, we calculate the sigmoid for the target word (positive sampling) to determine the loss and gradient.
let target_l2 = target * cbow_params.embeddings_dimension;
let f = neu1
.iter
.zip
.map
.;
let sig = sigmoid;
*epoch_loss += -sig.ln;
let g = * cbow_params.learning_rate;
Step 3: Updating Layers
Next, we update the hidden layer and the second neuron using the gradient.
for c in 0..neu1e.len
This process trains the model by signaling, “Hey, these are the expected values for this target,” moving related words closer together in the vector space.
Step 4: Negative Sampling
The next step flips this idea with negative sampling. Instead of focusing on context (related words), we pick words that aren’t near the target using random samples from the vocabulary.
let negative_samples = vocab_indices
.choose_multiple
.filter
.take;
We apply nearly the same operations as before to these negative samples.
for negative_target in negative_samples
Here’s the key difference: with context words, the gradient is ´(1.0 - sig) * cbow_params.learning_rate´, but with negative samples, it’s ´(0.0 - sig) * cbow_params.learning_rate´. This pushes unrelated words further apart, rather than pulling them closer.
Step 5: Backpropagation
Finally, we update the input layer, passing all the new knowledge back through backpropagation.
context.iter.for_each;
To sum it up, we update the layers positively with context words, repeat the process negatively with unrelated words, and then propagate the changes back to the input layer. That’s the core of it, simple yet effective. You can tweak tons of variables to boost learning, like a dynamic learning rate, the number of random samples, embedding sizes, smarter sample selection, or better tokenization strategies.
Making it blazingly fast
My focus, though, was on execution speed. I used Criterion to benchmark and flamegraphs to pinpoint what was slowing my code down.
SIMD Optimizations
One optimization was leveraging SIMD (Single Instruction, Multiple Data). In Rust, you don’t need to handcraft SIMD (though you could for bigger gains), you can hint to the compiler that a loop is worth vectorizing. Using cargo-asm, I verified SIMD instructions like:
mulps xmm3, xmm1
or
addps xmm3, xmm1
Reducing Bound Checks
Another tweak was reducing bound checks while keeping the code mostly safe. Instead of iterating over indices, I worked directly with the vectors:
for c in 0..cbow_params.embeddings_dimension
to
for in neu1e.iter_mut.enumerate
This lets the compiler handle bounds more efficiently.
This is what takes the most time currently
Taking the following time
The plot on the left displays the average time per iteration for this benchmark. The shaded region shows the estimated probability of an iteration taking a certain amount of time, while the line shows the mean.
Fun fact: combining addition and multiplication into one instruction sounded smart but doubled the runtime, check out this graph:
Parallelization with Rayon
Another optimization is, obviously, making it parallel. You know when I said I built this from scratch? Well, I stretched the truth a bit. I used a few libraries: one for the random number generator, another for a list of stop words (yep, I leaned on a library for that), and finally, Rayon to parallelize the iterations. I could’ve rolled my own parallelization, but Rayon’s ergonomics make it a breeze to run things in parallel. Initially, my code was single-threaded, kinda slow, not gonna lie. So, I added parallelization, and boom, new problems. In Rust, you’ve got two straightforward ways to handle shared data: wrap it in an Arc<Mutex and then With a little of parallelism I managed to reduce the execution by 4x With these changes, my word2vec is screaming fast—7 minutes faster than the original C implementation, which takes 22 minutes for the text8 dataset over 100 epochs. That’s a zippy 15 minutes, folks! Not bad, right? There’s always room to tweak further: better tokenization, smarter negative sampling, or dynamic learning rates. But for now, I’m stoked with a fast, functional word2vec that makes my notes way smarter. And there you have it, my adventure building word2vec from scratch in Rust! You can find the repo here if you want to try it out.;
unsafe
let input_ptr = UnsafePtr;
let hidden_ptr = UnsafePtr;
let input_len = input_layer.len;
let hidden_len = hidden_layer.len;
.into_par_iter.for_each;
Results and Reflections