Tuesday, January 1, 2019

[System Design]Tiny Url System

Scenario: Design Tiny Url Service to support following APIs:


  • Post url/shorten/, body = {url:xxxxx}
    • return a short url 
  • Get /<short_url>
    • return the correct long url it maps to
100M DAU to support, assume each user post 0.1 long url and click 5 short url daily. Write QPS = 0.1 * 100M / 86400 = 115, Read QPS = 5 * 100M / 86400 = 6k.

Storage:

Data Model:
  • Long url can map to multiple short url
  • Short url can only map to one long url
We have different ways to convert long url to short:
  1. sequential id, check here for more details
  2. hash + mod, hash + truncate
Pros and cons:
  1. rely on sequential id, so we have to use SQL DB
  2. conflicts, but given the write QPS is low, so it might still be feasible
Once we have a way to generate short url, we need to choose how to store the data.
For SQL, if we rely on sequential id, simple <id, long url> table is enough, for short url, we calculate the id and find the long url; for long url, if it is there, we return it, otherwise we generate short url and update table.
For NoSql, we need to store to mappings one is <short_url> to <long_url>, the other is <long_url> to <short_url> since NoSql doesn't support multi index well. Whenever we have a short url, we find the mapping in first table, otherwise we go to the second table, if there is no entry in second table, we generate short url and update both table

Scale:

Memcache to speed up read operations, we can store the short_url to long_url mapping in cache, key partition by consistent hashing. For long url to short, we might not need to cache it.

Sharding for NoSql: Since we have two tables, it is easy, for first table we shard by short_url, and long_url for second table
Sharding for SQL:
  • Shard by short url
    • For input short url, we can go to the correct node and return the long url
    • But how about long url? We don't know where it is sharded, may be we can broadcast, but it is not efficient.
    • Given that long url can map to multiple short url, we can still go to a random machine, if it is not there, we generate a new url. Trade space for speed
  • Shard by long url
    • Given long url, we can find correct node.
    • Same problem for short url, we can prepend it with sharding result, another digit at the beginning of short url, assume it was 6 digit AB123C, and let's say the long url sharded to 16th server, we can prepend F at the beginning, so it became FAB123C, and short url now has the sharding information in it
Sharding by region: Distributed cache service can be split by region, since usually US users tend to visit US website and CN users tend to visit CN website, the chance of hit will be higher if we shard by region. We can even shard the database by region.

Custom Url:
  • Need another table to store <custom url, long url> mapping
  • For given short url, we first check custom url table then original url table
  • For given long url, if we want to generate regular short url, we check if it exist in custom url table and original url table, make sure there will not be conflicts between short and custom
  • For given long url, if we want to generate custom url, we check original url table then custom table to make sure it doesn't exist already
 

No comments:

Post a Comment