今際の国の呵呵君: [System Design]News Feed

Scenario: Design a Twitter news feed. Similar questions, facebook, RSS, etc.

Assume 100M DAU to support, each user post 1 tweet per day and refresh for new update 20 times per day on average. Write QPS = 100M * 1 / 86400 = 1K, peak QPS = 1K * 3 = 3K; Read QPS = 100M * 20 / 86400 = 230K, peak QPS = 230 * 3 = 690K.

Service:

Tweet Service: Store tweets, timeline, news feed
Friend Service: mange following information

Storage:

Two models to get new updates:

Pull:

Tweet table will store following information:

Field	Type	Detail
tweet_id	int64	primary key
owner_id	int64
content	text
create_at	timestamp

Both SQL and NoSql is ok here, if we want to store in Nosql, it will be something like row_key = owner_id, column_key = timestamp + tweet_id, since we need to sort based on timestamp, and value is content.

Friends table will store following information:

Field	Type	Detail
from_id	int64
to_id	int64
create_at	timestamp

Both SQL and NoSql is ok here, for more details, please check here.

Workflow:

News Feed

Send request to get all following users of current user, SELECT to_id FROM friend_table WHERE from_id = id
For each friend, send request to Tweet Service to get all tweets after certain timestamp(or latest k tweets) and merge all friends' tweets and order them by timestamp

Post tweet

Insert an entry in tweet table

Push:

Tweet table and Friends table store the same thing.

We will need another news feed table to news feed for each user:

Field	Type	Detail
id	int64	primary key
tweet_id	int64
owner_id	int64
create_at	timestamp

If we want to store in Nosql, we can follow the same as tweet table, row_key = owner_id, column_key = timestamp, and value is tweet id.

Workflow:

News Feed

Get all updates in news feed table, SELECT tweet_id FROM newsfeed_table WHERE owner_id = id AND create_at > timestamp ORDERED BY create_at

Post tweet

Send request to get all followers of current user, SELECT from_id FROM friend_table WHERE to_id = id
For each of followers, insert an entry in news feed table
This process can be completed asynchronously

Pros and Cons:

Pull:

Too slow, since we need to ask DB to get every following user's timeline
But we can add cache layer to improve read speed, we can cache every one's time line(maybe first 100 posts), so most of operations are memory right now
We can also cache user's news feed, like first 300 posts, but every time we just return 20 updates, when users scrolls down, we can continue return from cache

Push:

Hot spot issue, if some users have really large group of followers, the fanout process will take really long
For followers greater than 1M, we can label them as hot spot and they will not push, but other people will pull from them when they refresh, along with their own news feed table to form the complete news feed
But this introduces another problem when someone cross the boundary. Assume user A has > 1, followers and he tweets, we won't push for it. After a while, his followers drops to < 1M, and other users won't pull, so his tweet will not read by some users.
To solve this, we can have transition layer between hot spot and normal users, if people are in this layer, they will still push, and other users will also pull them, problem solved!

Scale:

We will focus on push model. Tweet table can be sharded by owner_id, friends table sharded by from_id. For cache layer, adopt consistent hashing for key space partition.

Follow/Unfollow:

Once user A follows user B, asynchronously merge B's timeline to A's newsfeed in cache
Once user A unfollows user B, the other way around

Likes:

We can store as the table below:

Field	Type	Detail
id	int64	primary key
tweet_id	int64
user_id	int64
create_at	timestamp

For each tweet, if we want to display the data like like_nums, comment_nums, retweet_nums, it will be slow if we calculate those number on the fly by doing something like, SELECT COUNT(user_id) FROM like_table WHERE tweet_id = id.
We can adopt DB denormalization to store redundant data in tweet table and trade write performance for better read performance, each like action will modify like table and also tweet table:

Field	Type	Detail
tweet_id	int64	primary key
owner_id	int64
content	text
create_at	timestamp
like_nums	int64
comment_nums	int64
retweet_nums	int64

今際の国の呵呵君

Tuesday, January 1, 2019

[System Design]News Feed

No comments:

Post a Comment