Tuesday, January 1, 2019

[System Design]News Feed

Scenario: Design a Twitter news feed. Similar questions, facebook, RSS, etc.

Assume 100M DAU to support, each user post 1 tweet per day and refresh for new update 20 times per day on average. Write QPS = 100M * 1 / 86400 = 1K, peak QPS = 1K * 3 = 3K; Read QPS = 100M * 20 / 86400 = 230K, peak QPS = 230 * 3 = 690K.

Service:

Tweet Service: Store tweets, timeline, news feed
Friend Service: mange following information

Storage:

Two models to get new updates:


Pull:


Tweet table will store following information:

Field Type Detail
tweet_id int64 primary key
owner_id int64
content text
create_at timestamp

Both SQL and NoSql is ok here, if we want to store in Nosql, it will be something like row_key = owner_id, column_key = timestamp + tweet_id, since we need to sort based on timestamp, and value is content.

Friends table will store following information:

Field Type Detail
from_id int64
to_id int64
create_at timestamp

Both SQL and NoSql is ok here, for more details, please check here.

Workflow:

  • News Feed
    • Send request to get all following users of current user, SELECT to_id FROM friend_table WHERE from_id = id
    • For each friend, send request to Tweet Service to get all tweets after certain timestamp(or latest k tweets) and merge all friends' tweets and order them by timestamp
  • Post tweet
    • Insert an entry in tweet table
Push:

Tweet table and Friends table store the same thing.

We will need another news feed table to news feed for each user:

Field Type Detail
id int64 primary key
tweet_id int64
owner_id int64
create_at timestamp

If we want to store in Nosql, we can follow the same as tweet table, row_key = owner_id, column_key = timestamp,  and value is tweet id.

Workflow:


  • News Feed
    • Get all updates in news feed table, SELECT tweet_id FROM newsfeed_table WHERE owner_id = id AND create_at > timestamp ORDERED BY create_at
  • Post tweet
    • Send request to get all followers of current user, SELECT from_id FROM friend_table WHERE to_id = id
    • For each of followers, insert an entry in news feed table
    • This process can be completed asynchronously
Pros and Cons:

Pull:
  • Too slow, since we need to ask DB to get every following user's timeline
  • But we can add cache layer to improve read speed,  we can cache every one's time line(maybe first 100 posts), so most of operations are memory right now
  • We can also cache user's news feed, like first 300 posts, but every time we just return 20 updates, when users scrolls down, we can continue return from cache
Push:
  • Hot spot issue, if some users have really large group of followers, the fanout process will take really long
  • For followers greater than 1M, we can label them as hot spot and they will not push, but other people will pull from them when they refresh, along with their own news feed table to form the complete news feed
  • But this introduces another problem when someone cross the boundary. Assume user A has > 1, followers and he tweets, we won't push for it. After a while, his followers drops to < 1M, and other users won't pull, so his tweet will not read by some users.
  • To solve this, we can have transition layer between hot spot and normal users, if people are in this layer, they will still push, and other users will also pull them, problem solved!
Scale:

We will focus on push model. Tweet table can be sharded by owner_id, friends table sharded by from_id. For cache layer, adopt consistent hashing for key space partition. 

Follow/Unfollow:
  • Once user A follows user B, asynchronously merge B's timeline to A's newsfeed in cache
  • Once user A unfollows user B, the other way around
Likes:

We can store as the table below:
Field Type Detail
id int64 primary key
tweet_id int64
user_id int64
create_at timestamp

For each tweet, if we want to display the data like like_nums, comment_nums, retweet_nums, it will be slow if we calculate those number on the fly by doing something like,  SELECT COUNT(user_id) FROM like_table WHERE tweet_id = id.
We can adopt DB denormalization to store redundant data in tweet table and trade write performance for better read performance, each like action will modify like table and also tweet table:


Field Type Detail
tweet_id int64 primary key
owner_id int64
content text
create_at timestamp
like_nums int64
comment_nums int64
retweet_nums int64


No comments:

Post a Comment