PC#13 - Problems Caused by DB Replication Lag

Uber's Migration (Monolithic Postgres to Schemaless) and More...

Hello, this is Saurabh…👋

Welcome to the 101 new subscribers who have joined us since last week.

Also, I’m really glad to inform you that the newsletter crossed 1,000 subscribers this week.

In other words, we are a 1000+ strong community now. It’s a great feeling to be connected with all of you out there and it only motivates me to add more value to the newsletter in the coming months and years.

In this edition, I cover the following topics:

🖥 System Design Concept → Problems Caused By DB Replication Lag

🧰 Case Study → Uber’s Migration (Monolithic Postgres to Schemaless)

🍔 Food For Thought → Stateless Systems & Solving Problems vs Coding

So, let’s dive in.

🖥

System Design Concept

Problems Caused By Replication Lag

Database Replication always has some amount of replication lag.

It’s part of the equation.

We discussed the reasons for Replication Lag in detail in the last post.

Typically, the Replication Lag is small enough to not cause major issues.

However, the bigger the lag gets the more problems it can potentially create for your application.

Let’s look at a few major problems:

1 - Vanishing Updates

Imagine updating your social media profile with some new details. But when you press save, you don’t see your updates reflected on the page.

That’s horrible! 

Those updates seem to have vanished from the face of the Earth.

And for a moment, you feel the frustration of going through all that effort for nothing.

The same frustration can bite you when you write a super-thoughtful comment on a blog post or a YouTube video. And the next thing you know, it’s gone.

What’s the reason for this mess?

Yes, you guessed it correctly - Replication Lag!

While your updates were successful on the leader node, the subsequent view (or the read) request went to a lagging replica.

The below illustration shows the sequence of events in picture format.

  • User A updates the profile on the primary or leader node.

  • The immediate request to read the contents of the profile goes to a replica that has not yet been updated. The user sees stale data and will be unhappy.

  • Eventually, the async replication does its job and the replication is successful.

So - how can systems get around this issue?

By providing read-after-write consistency also known as read-your-writes consistency.

Basically, it’s a guarantee that if a user makes some updates, they will always see those updates when they reload the page.

This reassures the user that their own input has been saved correctly.

However, this approach doesn’t make any consistency guarantees for other users.

How do you implement this guarantee?

There are several ways to implement read-after-write guarantee:

  • When reading something that the user has modified, read it from the leader. Otherwise, read it from the follower. This is ideal for cases such as profile information that can only be editable by the owner of the profile.

  • The client keeps track of the timestamp for the most recent write. The system ensures that the replica serving any reads for that user reflects updates until that timestamp.

  • Monitor the replication lag on followers and prevent queries on any followers or replicas that are more than one or two minutes behind the leader.

  • Databases (such as DynamoDB) and Object Storage (such as S3) have started providing consistency guarantees as an option.

2 - Going Backward in Time

Replication Lag in databases can take your users backward in time.

And they might not appreciate the sensation

Here’s an illustration that shows what might happen:

  • User A adds a comment on a blog post. The write operation is successful on the Primary node.

  • The comment is successfully replicated to Replica 1.

  • User B checks the comments on the blog post.

  • The request goes to Replica 1 and User B is able to read the comment made by User A in Step 1

  • User B refreshes the page for whatever reason but this time, the read request goes to Replica 2 which is lagging.

  • There is no sign of the comment made by User A.

  • After some time, Replica 2 also receives the updates from the Primary but the user experience is ruined.

This wouldn’t sound so bad if the first query hadn’t returned anything.

After all, User B wouldn’t even know that there was a new comment by User A.

However, it’s more confusing for User B because they have first seen the comment and then it has disappeared.

So - how to get around this issue?

Your system needs to guarantee Monotonic Reads.

This guarantee means that if one user makes several reads in sequence, they won’t see time go backward.

In other words, they won’t read older data after previously reading newer data.

How do systems manage this?

A simple trick is to make sure that each user always makes their reads from the same replica.

This stickiness can be achieved by choosing the replica based on the hash of the User ID rather than randomly.

3 - Violation of Causality

The third problem caused by replication lag is called the violation of causality.

This is a special situation that happens in the case of partitioned or sharded databases.

The below illustration shows this problem:

  • User A writes a message “Hi UserB, How Are You?” to the Leader node of Partition 1.

  • User B replies “I’m fine, User A” to the Leader node of Partition 2.

  • The message “I’m fine, User A” gets replicated to the Follower node of Partition 2

  • The message “Hi UserB, How Are You?” gets replicated to the Follower node of Partition 1 but it happens much later than the replication for Partition 2.

  • Now, imagine there’s a user called Viewer who is reading these messages from the follower nodes.

  • The Viewer reads “I’m fine User A” from Partition 2 (follower). Then, she reads “Hi UserB, How Are You?” from Partition 1 (follower).

From the perspective of the Viewer, this is almost like magic.

User B answered the question by User A even before User A asked the question. This is a violation of cause and effect.

To prevent this situation, the system needs to provide another guarantee known as consistent prefix reads.

This guarantee ensures that if a sequence of writes happens in a certain order, then someone reading those writes will see them appear in the same order.

🧰

Case Study

Uber’s Migration From Monolithic Postgres to Scalable Data Store

Though this case is a bit old, it has lots of good learnings that are relevant even today.

In 2014, Uber was growing like crazy. Their trip growth was almost a whopping 20% every month.

This meant that very soon, their trip storage database was going to run out of storage volume and IOPS.

Like most web applications in the initial phase, Uber’s backend system was also monolithic. It consisted of a few app servers and a single database.

There were 3 main parts to this system and each part played a fundamental role:

  • Real-time services

  • Backend services

  • PostgreSQL

While this architecture was sufficient for Uber’s initial operations in a few cities, it couldn’t keep up with Uber’s growth.

There was something that had to be done to keep things stable from a system point of view.

Since the Trip data was taking up the largest percentage of the overall pie, it also contributed to the most IOPS. Most of the other Uber services such as rider and driver support, fraud detection and suggested pick-ups also relied on the Trip data.

Therefore, it was decided to refactor the system to move the Trip data to a separate scalable data store.

See the below diagram:

There were a few important requirements for the new Trip store:

  • Horizontally scalable in storage capacity and IOPS

  • High write-availability. They were fine to trade off short-term read availability.

  • Secondary index support so that trips can be looked up by various parameters such as user or city.

  • No downtime for operations such as expanding storage, taking backups, adding indexes and so on).

The decision was made to go for a column-oriented, schemaless approach.

What does it mean?

The idea was to organize data as JSON blobs in a grid indexed by trip-UUID.

Horizontal scaling was achieved by partitioning the rows across multiple shards.

Since the database was schemaless, new columns and fields can be added without any reconfiguration. This supported the rapid development culture that Uber wanted to encourage.

However, Uber didn’t find a suitable NoSQL database with the above characteristics. There were either operational experience issues or product maturity issues.

Therefore, they decided to build their own sharded data store on top of MySQL.

Here are a few cool characteristics of this data store:

  • Append-only (no updates) data model

  • Buffered writes

  • Sharding

  • Sharded secondary indexes.

As an estimate, it took them 5 months to build this database and it was initially named Schemaless.

The Migration

It was a complicated migration process considering that crucial parts of the live system had to be modified from using a PostgreSQL database to a column-oriented database.

The below architecture was taken as a target:

The lib/tripstore component basically exposed an API that could communicate with a schemaless-based implementation. This component had a switch so a query could either go to PostgreSQL or through Schemaless.

This allowed the team to mirror the incoming writes to Schemaless and then replay all the queries to Schemaless for verification.

A few important phases of the project (mentioned by the Uber team) were as follows:

  • Changing all trip-ids to trip-UUIDs.

  • Building the Column layout for the new trips data model.

  • Backfilling data from PostgreSQL to Schemaless.

  • Implementing mirrored writes to PostgreSQL and Schemaless

  • Rewriting all queries to work with Schemaless

  • Lots of validation.

Lessons Learned

There were a lot of lessons learned along this whole journey. And this is something that we, as developers, can also absorb by reading Uber’s case study.

Here are a few major lessons:

  • Always use UUIDs. Starting out with integer IDs can be problematic if you experience large growth and it’s tedious to undo it later on.

  • The column-based approach with sharded indexes is a powerful pattern that gives more control to the programmers in terms of performance.

  • Keep the data layer simple to debug and troubleshoot.

  • The correct data model comes through trial and error

  • Don’t waste too much time on the final migration. Make it quick because it’s always going to be a moving target.

P.S. This post is inspired by the explanation provided on the Uber Engineering Blog. However, the diagrams have been drawn or re-drawn based on the information shared. You can find the original article over here.

🍔

Food For Thought

👉 What is Stateless Architecture?

Stateless architecture can be a confusing term. At face value, it seems to imply that a system built with such an architecture has no state.

But that’s not the case.

Here’s a post I wrote a few days ago on X(Twitter) where I briefly explained how you should think about stateless systems.

As of this moment, the post has got over 400 likes and over 80 reposts.

Do check it out 👇

👉 Solving the Problem vs. Coding

We often have the tendency to jump into coding everything from scratch.

It’s a classic case of reinventing the wheel every time you need to move from point A to point B.

Here’s a thought-provoking post by Sagar suggesting that we should stop writing code for everything.👇

That’s it for today! ☀️

Enjoyed this issue of the newsletter?

Share with your friends and colleagues

Also, send them over here to subscribe.

In case you want me to cover any specific topic in future editions, please don’t hesitate to fill out the below idea-suggestion form.

See you later with another value-packed edition — Saurabh.

Join the conversation

or to participate.