Monday, December 31, 2018

[System Design]Web Crawler

Scenario: Design a web crawler web pages.

How many webpages to crawl?
Google indexes 30 trillion web pages.

How long to crawl all the data?
Let's say 10 days. Which means we need to crawl 30 * 10^12 / (10 * 86400) = 35M webpages per second.

How large is the data?
Assume each webpage 's size is 10K, 35M  * 10K = 350G data per day.
For all the webpages, we need 30 * 10^12 * 10K = 300P space to store all those information

How do we implement web crawler on a single machine?

This problem is basically graph BFS problem(why not DFS, you will get stack overflow definitely). A really simple implementation could be something like this:


But of course there are many problem with this design:

  1. Toooooo slow! The network IO is expensive, having single thread won't utilize this period of time, we simply waste CPU time
  2. If we want to store on disk or DB, the IO is also expensive, same as the issue above
  3. We do not persist the data, which means we can not stop the work then resume
  4. Spider trap?
  5. Politeness
  6. ....
Let's focus on the speed issue right now, having a single thread is definitely too slow and CPU time is wasted. We definite need to use multi threading, but we also need to make sure all threads work correctly without affecting each other, there will be race condition issue we need to solve.:
  • The task queue needs to be thread safe, so multiple producers and consumers need to add and remove the data safely. For more information, please check this article related to producer consumer queue.
  • We also need a DB to store the crawled webpages and if a link is crawled before. It is also need to be thread safe.
So we can design a multi-threaded web crawler with following modules:
  • Crawler modules, many threads keeps getting tasks from queue and crawl the webpages, then add new tasks to the queue
  • Task queue, which is essentially producer consumer queue, need to be thread safe, we also need to persist those tasks
  • DB to store the crawled results
Service:

More generally, based on what we discussed above, it can be split into different services:
  • Crawler Service, main business logic
  • Task Service, store and schedule tasks to be executed
  • Storage Service, store crawled data

Storage:

How do we store the task and results?

We can have a task table like below:


Field Type Detail
task_id primary key
url string
priority int
status int working/idle
available_time timestamp next time we should crawl this page

SQL database is better since we may need to index and sorted by state/priority/available_time

And crawled result will just be a <url, content> pair, we store in nosql database:
  • row_key = url
  • column_key = "content"(other column key can also be added like anchors, links, title, etc)
  • value = actual content

Scale:

The task service definitely needs to be scaled if we want to support huge QPS. Otherwise the selection process will be super slow, the task table can be sharded by task id, we will also need a scheduler/message queue to fetch and dispatch those tasks, we won't go deep in this part since it will be another different topic, you can reference to rabbitMQ/Kafka for more details. The storage service will be a distributed nosql database, like BigTable. Different crawler machine will keep get tasks from task queue and crawl pages.


No comments:

Post a Comment