essay

Why I built a Huffman encoder from scratch

apr 2026 · 12 min · arshan kaudinya

Ifirst learned about Huffman coding in a data structures course. The concept made sense on paper: count symbol frequencies, build a binary tree, assign shorter codes to more frequent symbols. I could walk through the algorithm on a whiteboard, trace the tree by hand, and explain why it produces an optimal prefix-free code. But I never actually felt like I understood it. There is a difference between explaining an algorithm and knowing what it feels like to watch your own code compress real data.

That gap kept nagging at me. I wanted to go beyond compression too. What happens after you shrink a file? How does compressed data actually move across a network? What happens when packets get lost, arrive out of order, or show up late? I wanted to build the whole pipeline: compress the data, break it into packets, simulate an unreliable network, and reassemble everything on the other side. Not a toy demo, but something with real binary protocols, checksums, error handling, and concurrency.

So I built encoder-sim: an educational file transfer system written entirely in Rust. It compresses files using Huffman coding, transmits them through a simulated network with configurable impairments, and reconstructs the original data on the receiving end. The whole thing runs as a five-thread concurrent pipeline connected by bounded channels.

Why Rust

I chose Rust because this project is fundamentally about bits, bytes, and memory layout. I did not want a garbage collector quietly managing things behind the scenes. I wanted to feel every allocation, every ownership transfer, every borrow. Rust’s ownership model and borrow checker force you to think about data lifetimes in a way that other languages let you ignore. When the compiler fights you, it is usually teaching you something about your own design. This project was as much about learning Rust properly as it was about compression.

Building the compression engine

The encoder itself starts with frequency analysis. You scan the input data, count how often each byte value appears, and feed those frequencies into a min-heap to build the Huffman tree. The two lowest-frequency nodes get merged into a parent, and you repeat until a single root remains. From there, you walk the tree to compute code lengths, then convert everything into canonical Huffman codes so the decoder only needs the lengths, not the full tree structure.

The tricky part was the bit-level I/O. Huffman codes are variable length: some are 2 bits, others are 7. You cannot just write bytes. I built a BitWriter that accumulates bits in a buffer and flushes complete bytes, and a BitReader that tracks position down to the individual bit. Getting the MSB-first alignment right, handling the padding on the final byte, making sure the reader and writer agreed on the exact same bit ordering: that took longer than the Huffman tree itself. But when I ran the first round-trip test and the decoded output matched the original byte-for-byte, that was the moment the algorithm stopped being textbook material and became something I actually owned.

Framing, packets, and the hostile network

Each compressed chunk gets wrapped in a binary frame with a 26-byte header that carries everything the receiver needs to validate and reconstruct the data:

—magic bytes to identify the frame format
—chunk ID, raw length, metadata length, and payload length
—a CRC32 checksum so corruption is caught immediately

The frame is then sliced into MTU-sized packets, each with its own header tracking which chunk it belongs to and its position in the sequence. This is where the project stopped being a compression exercise and started feeling like real systems work.

The network simulator sits in the middle of the pipeline and does everything a real network does to make your life harder. It simulates four kinds of impairment:

—latency that delays every packet by a base amount
—jitter that randomizes the delay, so packets sent back-to-back can arrive in reverse order
—packet loss that randomly drops packets entirely
—reordering that shuffles packets within a sliding window

Packets go into a priority queue keyed by their computed delivery time. All randomness is seeded with ChaCha8, so every run is fully deterministic and reproducible. Same seed, same chaos, same results.

Five-thread concurrent pipeline architecture

On the receiving end, the reassembler collects packets and buffers them until all fragments of a chunk have arrived. It handles out-of-order delivery by holding onto later chunks while waiting for earlier ones, and it enforces timeouts so a single lost packet does not stall the entire pipeline. Getting the in-order delivery guarantee right, where chunk 2 waits in a buffer until chunks 0 and 1 have been emitted, was one of those problems that sounds simple until you are debugging it at midnight.

Five threads, one pipeline

The whole system runs as five threads connected by bounded crossbeam channels:

1.chunker splits the input and compresses each chunk with Huffman
2.packetizer fragments each frame into MTU-sized packets
3.network simulator applies latency, jitter, loss, and reordering
4.receiver reassembles packets back into chunk frames in order
5.decoder decompresses and writes the final output

Bounded channels give you natural backpressure: if the receiver falls behind, the network simulator blocks on send, which backs up the packetizer, which backs up the chunker. No thread can overwhelm another. Rust made the concurrency surprisingly manageable. The ownership system prevents data races at compile time, so the bugs I hit were logical ordering issues, not memory corruption.

I structured the project as a Rust workspace with a core library and a thin application binary. The core crate contains all the reusable logic: bit I/O, Huffman codec, framing, packetization, network simulation, reassembly, and metrics. The app crate handles CLI parsing, file I/O, and pipeline orchestration. This split made testing straightforward. Every module has unit tests, and the integration tests run complete compress-transmit-reassemble-decompress round trips.

What building from scratch actually teaches you

The project taught me more about compression, networking, and concurrency than any course did. Not because the courses were bad, but because building forces you to confront every edge case that a lecture can skip over. What happens when two symbols have the same frequency? What if the last byte of compressed data has padding bits? What if a packet arrives for a chunk that already timed out? These are the questions you only discover when you are the one writing the code, and they are exactly the questions that make you actually understand the system.

Check it out: github.com/ArshanKaudinya/Huffman-encoder-sim

← back to writing

← writing