Scenario: Design a typeahead service which support following functionalities:
- Be able to collect the data and keep track of the sentences users has typed recently
- For every word user typed, return the top k hot sentences start with this prefix
Services:
Data Collection Service: collect the data form users and do preprocessing to get frequency of each service
Query Service: Store hot words for every prefix and return the top k hot words for given query request. Let's assume there are 1B active user and each of them do 10 searches daily and average sentence length is 10(every time user type a char we may need to search for the new prefix), QPS = 1B * 10 * 10 / 86400 = 1M, peak QPS = 2 * 1M = 2M. It is a pretty ready heavy service.
Storage:
Where do we get the data?
We can get the data by logging, each time user typed a full sentence and send a request, we will log the data in the disk, like <user_id, sentence, timestamp>. The log information can be stored in distributed file system like GFS, since usually the new generated data every day is really large. Let's use the estimated number above, assume each log entry will be 20 bytes the data generated every day will be 1B * 10 * 20 Bytes = 200G. And if we want to the data for past two weeks, it will be 200G * 14 = 2.8 T.
After we have the log data, we can perform the map reduce to get the <sentence, frequency> pair we want, since it is a simple key-value pair, we can store it in distributed nosql database like BigTable.
It will still be pretty expensive to store this large amount of data, actually we don't want the service to be that accurate, we just want to know the general trend. This is why we need to bring up probabilistic logging, every time we can generate a random number between [0, 1000) and if the number equals to 0, we log this data, so generally we are logging with 1 / 1000 probability. And the data generate every day can be reduced from 200G to 200M.
How do we store the data?
For Data Collection Service, it will be a simple <sentence, frequency> pair
For Query Service, it will be <prefix, list_of_sentences> pair, this table can be easily constructed by the table above.
Both are essentially key-value pair, nosql database is good for this condition.
Scale:
Each table can be sharded by sentence and prefix respectively, since the data is big we need to partite them into different ranges and store on different machines.
Since the QPS is pretty high we will also need an extra cache layer, we can use the general cache mechanism to store the <prefix, list_of_sentences>. The better way will be storing the trie if we are able to load all the data into the memory, like what we did in this problem, we can also persist the trie in disk in case we lose the cache. The cache layer can be sharded by the key, so the trie becomes a distributed trie, and each request will be routed to the correct server to handle it. If we can't load all the data into the memory and we need a kick off strategy, it is better to adopt the general cache mechanism since it will be hard to do this kind of things in trie.
No comments:
Post a Comment