Understanding oplog, replSet, retryWrites
Back story (yapp yapp)
I used to avoid Docker ever since I prematurely picked it up to utilize for one of my API projects I built last year. Back then I still didn't understand the Dockerfile or the compose files and the basic terminal commands used to spin things up — but not anymore. By now I've had experience with Linux (Ubuntu), which I set up using WSL, and that has made me a bit more fluent with common Linux terminal commands. I've also written configuration files in YAML for things like cron jobs with GitHub Actions workflows.
But at my university, the network stability is so poor that I had to find and set up an offline way to be productive — and part of that was Docker for offline database usage. Picking it back up made me realize I've been missing out on using a good ol' lovely and useful tool that could take my skills and work to another level.
One thing I've realized is that Docker lets you manually handle how your server behaves. Nothing is bundled or covered by an external tool — you literally have to set it all up — which is cool because it improves your knowledge of the underlying system and how it operates.
The bug that started all this
I used to use MongoDB Atlas's URI for the database in development — forgive me, it was easier then. Not anymore.
A typical MongoDB URI on Atlas looks like this:
mongodb://username:password@cluster0.<region>.mongodb.net:27017/database_name?retryWrites=true&w=majority
Let's just focus on this connection string setting: retryWrites=true, which led to this bug message:
{
"success": false,
"message": "This MongoDB deployment does not support retryable writes. Please add retryWrites=false to your connection string.",
"stack": "MongoServerError: This MongoDB deployment does not support retryable writes..."
}
What is a retryable write?
It's basically a mechanism where, if enabled, the driver automatically retries a client request if the first attempt wasn't successful — for example, due to a network glitch.
But retryable writes can only work if there is a replica set present, which is basically servers split into a primary node and secondary nodes.
The primary node is the main server that actively handles write operations from the client, while the secondaries mainly replicate the primary's log history from the oplog.
The secondary nodes exist in case one of them has to replace the primary if there is a problem with it. The replacement happens so fast — in milliseconds — and the way they go about electing a new primary is to compare history and select the one with the most up-to-date records.
The oplog
The oplog is an internal collection that logs write operations in order, including their respective IDs. An example history looks like this:
txn-123 | insert x at time t
txn-456 | update y at t+1
How retryable writes actually work
Here's a concrete example:
- An insert command is sent by the client
- The oplog records the txn history:
txn-456 | update y at t+1 - The primary resolves the command, but a network glitch occurs before it sends a response back to the client
- The client, seeing no response, retries the same request
- If
retryWritesis enabled, the primary receives the same command and checks through its IDs whether it has already handled this transaction — it says "wait, I already did this" and immediately sends the response back to the client
Secondary servers just replicate the primary's history as read-only — they're not involved in this retry conversation at all.
By now we've covered the oplog, retryable writes, and replica sets.
P.S.
retryWriteswould still work if you only had a primary node — but in production, that would mean if your primary server goes down, it's just down. No backup server to replace it.There is
retryWritesin the MongoDB Atlas URI connection string. I didn't pay enough attention to it because I'd been copying and pasting with no error in sight. Not entirely my fault though — I've been getting away with these things because I haven't been configuring them myself.There's still more to the story of this bug and how I fixed it — so this is part 1.
Writing about this has been really helpful to me, even if no one reads it. It pushes me to research the underlying problem more deeply and clears up my assumptions.
If you enjoy reading 😉 and you’d like to support me, you could buy me a coffee here : send me a smile 😊