今際の国の呵呵君: [System Design]Web Crawler

Scenario: Design a web crawler web pages.

How many webpages to crawl?
Google indexes 30 trillion web pages.

How long to crawl all the data?
Let's say 10 days. Which means we need to crawl 30 * 10^12 / (10 * 86400) = 35M webpages per second.

How large is the data?
Assume each webpage 's size is 10K, 35M * 10K = 350G data per day.
For all the webpages, we need 30 * 10^12 * 10K = 300P space to store all those information

How do we implement web crawler on a single machine?

This problem is basically graph BFS problem(why not DFS, you will get stack overflow definitely). A really simple implementation could be something like this:

	import urllib2

	def get_all_links(page):
	#extract all links from the html page

	def get_page(link):
	#get the html page


	def addTasks(tasks, newTasks):
	for task in newTasks:
	tasks.append(task)


	def crawl(seed)
	{
	tasks = [seed]
	visited = []
	while(tasks):
	link = tasks.pop();
	if link not in visited:
	page = get_page(link)
	links = get_all_links(page)
	addTasks(tasks, links)
	visited.append(link)
	#do something with the page...
	#maybe store in DB
	}

view raw really_simple_web_crawler.py hosted with ❤ by GitHub

But of course there are many problem with this design:

Toooooo slow! The network IO is expensive, having single thread won't utilize this period of time, we simply waste CPU time
If we want to store on disk or DB, the IO is also expensive, same as the issue above
We do not persist the data, which means we can not stop the work then resume
Spider trap?
Politeness
....

Let's focus on the speed issue right now, having a single thread is definitely too slow and CPU time is wasted. We definite need to use multi threading, but we also need to make sure all threads work correctly without affecting each other, there will be race condition issue we need to solve.:

The task queue needs to be thread safe, so multiple producers and consumers need to add and remove the data safely. For more information, please check this article related to producer consumer queue.
We also need a DB to store the crawled webpages and if a link is crawled before. It is also need to be thread safe.

So we can design a multi-threaded web crawler with following modules:

Crawler modules, many threads keeps getting tasks from queue and crawl the webpages, then add new tasks to the queue
Task queue, which is essentially producer consumer queue, need to be thread safe, we also need to persist those tasks
DB to store the crawled results

Service:

More generally, based on what we discussed above, it can be split into different services:

Crawler Service, main business logic
Task Service, store and schedule tasks to be executed
Storage Service, store crawled data

Storage:

How do we store the task and results?

We can have a task table like below:

Field	Type	Detail
task_id	primary key
url	string
priority	int
status	int	working/idle
available_time	timestamp	next time we should crawl this page

SQL database is better since we may need to index and sorted by state/priority/available_time

And crawled result will just be a <url, content> pair, we store in nosql database:

row_key = url
column_key = "content"(other column key can also be added like anchors, links, title, etc)
value = actual content

Scale:

The task service definitely needs to be scaled if we want to support huge QPS. Otherwise the selection process will be super slow, the task table can be sharded by task id, we will also need a scheduler/message queue to fetch and dispatch those tasks, we won't go deep in this part since it will be another different topic, you can reference to rabbitMQ/Kafka for more details. The storage service will be a distributed nosql database, like BigTable. Different crawler machine will keep get tasks from task queue and crawl pages.

今際の国の呵呵君

Monday, December 31, 2018

[System Design]Web Crawler

No comments:

Post a Comment