Hassan2bit

Understanding oplog, replSet, retryWrites

Back story (yapp yapp)

I used to avoid Docker ever since I prematurely picked it up to utilize for one of my API projects I built last year. Back then I still didn't understand the Dockerfile or the compose files and the basic terminal commands used to spin things up — but not anymore. By now I've had experience with Linux (Ubuntu), which I set up using WSL, and that has made me a bit more fluent with common Linux terminal commands. I've also written configuration files in YAML for things like cron jobs with GitHub Actions workflows.

But at my university, the network stability is so poor that I had to find and set up an offline way to be productive — and part of that was Docker for offline database usage. Picking it back up made me realize I've been missing out on using a good ol' lovely and useful tool that could take my skills and work to another level.

One thing I've realized is that Docker lets you manually handle how your server behaves. Nothing is bundled or covered by an external tool — you literally have to set it all up — which is cool because it improves your knowledge of the underlying system and how it operates.

The bug that started all this

I used to use MongoDB Atlas's URI for the database in development — forgive me, it was easier then. Not anymore.

A typical MongoDB URI on Atlas looks like this:

mongodb://username:password@cluster0.<region>.mongodb.net:27017/database_name?retryWrites=true&w=majority

Let's just focus on this connection string setting: retryWrites=true, which led to this bug message:

{
  "success": false,
  "message": "This MongoDB deployment does not support retryable writes. Please add retryWrites=false to your connection string.",
  "stack": "MongoServerError: This MongoDB deployment does not support retryable writes..."
}

What is a retryable write?

It's basically a mechanism where, if enabled, the driver automatically retries a client request if the first attempt wasn't successful — for example, due to a network glitch.

But retryable writes can only work if there is a replica set present, which is basically servers split into a primary node and secondary nodes.

The primary node is the main server that actively handles write operations from the client, while the secondaries mainly replicate the primary's log history from the oplog.

The secondary nodes exist in case one of them has to replace the primary if there is a problem with it. The replacement happens so fast — in milliseconds — and the way they go about electing a new primary is to compare history and select the one with the most up-to-date records.

The oplog

The oplog is an internal collection that logs write operations in order, including their respective IDs. An example history looks like this:

txn-123 | insert x at time t
txn-456 | update y at t+1

How retryable writes actually work

Here's a concrete example:

  1. An insert command is sent by the client
  2. The oplog records the txn history: txn-456 | update y at t+1
  3. The primary resolves the command, but a network glitch occurs before it sends a response back to the client
  4. The client, seeing no response, retries the same request
  5. If retryWrites is enabled, the primary receives the same command and checks through its IDs whether it has already handled this transaction — it says "wait, I already did this" and immediately sends the response back to the client

Secondary servers just replicate the primary's history as read-only — they're not involved in this retry conversation at all.

By now we've covered the oplog, retryable writes, and replica sets.


P.S.

If you enjoy reading 😉 and you’d like to support me, you could buy me a coffee here : send me a smile 😊