今際の国の呵呵君: [System Design]Location Based Service

Problem: Design a location based service like Uber?

Scenario:
Rider should be able to request a service, an available driver should be matched
We should be able to keep track of driver's location->drivers report location every 4 seconds(based on this article)

Services:

Geo Service: stores drivers location information, handle queriers like "give me drives within 2kms of this location". Since drivers report location information every 4 seconds, it will be write heavy. Assume 0.2M active drivers per day, QPS will be 200000 / 4 = 50K, Peak QPS might be 50 * 3 = 150K, writes need to be efficient

Dispatch Service: To handle riders request, matches the driver and keep track of the trip information

Storage:

GeoService:

Driver's location, schema:

Field	Type	Detail
driver_id	primary key	user id
longitude	float
latitude	float
status	int	if available
created_at	timestamp

Dispatch Service:

Trip information, schema:

Field	Type	Detail
trip_id	primary key
rider_id	foreign key	user id
driver_id	foreign key	user id
status	int	waiting/driver on the way/in trip/ended/cancelled
longitude	float
latitude	float
created_at	timestamp

Of course we will need other tables to store riders and drivers informations, like their personal details, car information and status, but they are not the crucial part so we won't talk about it in details.
Both database will be SQL based, for location table, we need to index by latitude and longitude. For trip table, we will also need to index by trip_id, rider_id, driver_id.

Naive workflow:

Driver reports location to Geo service every 4 seconds and update location information
Rider request taxi, a new entry is created in trip table, trip id returned
On the backend, the dispatch service send a query to Geo service with rider's currently location
A list of available drivers are returned and matching algorithm is performed to find a match
The trip entry will be modified based on the match found
Both riders and drivers keep asking dispatch service if there is a match, and return the result

We can see there are several problem with this solution:

Driver will send/ask both Geo service and dispatch service periodically, we can combine those information in a single request and set to Dispatch service, then Dispatch service send request to Geo service to update the information asynchronously
Step 3 will be super inefficient, if we want to query based on difference delta, it will be 2 range queries in a SQL database(currlong - delta <= longitude <= currlong + delta, currlat - delta <= latitude <= currlat + delta)
Step 1 will also be inefficient since we are updating SQL table with index really often
If we don't store lat and longt, what should we store?

How to store GeoIndex:

There are many solution, but we will just talk about this two:

Google S2

Divides the earth into tiny cells using the Google S2 library. Each cell has a unique cell ID
Using an int64 every square centimeter on earth can be represented. Uber uses a level 12 cell, which are 3.31 km to 6.38 km, depending on where on earth you are. The boxes change shape and size depending on where on the sphere they are
S2 can give the coverage for a shape. If you want to draw a circle with a 1km radius centered on London, S2 can tell what cells are needed to completely cover the shape

GeoHash

A geohash actually identifies a rectangular cell: at each level, each extra character identifies one of 32 sub-cells
Recursively divide the region and reach the granularity we want
32base string, 0-9 a-z without a i l o, the longer the string, the more accurate it is

By using this we can change the location table to be some like this:

Field	Type	Detail
Geohash/Cell id	primary key	represent a small region in earch
drivers		{dirver1, driver2,...}

This will be key-value storage and we can use Nosql database here to improve the read/write speed. We can also store different level of cells and geohash string to represent different size of regions and we can query smaller regions first than larger.

Optimized workflow:

Driver reports location to Geo service every 4 seconds, update is location information
Rider request taxi, a new entry is created in trip table, trip id returned
On the backend, the dispatch service send a query to Geo service with rider's currently location
Geohash with specific prefix/Cells ids included in nearby xkm is queried and a list of available drivers are returned
Matching algorithm is performed to find a match
The trip entry will be modified based on the match found
Both riders and drivers keep asking dispatch service if there is a match, and return the result

Scale:

To support high QPS, scaling is definitely needed. For the storage nodes, location table can be sharded based on geohash/cell id. Trip table can be sharded by trip id. Multiple front end server is also need to do load balancing. We can also use cache to speed up reading and database replicas can also split reading traffics.

Uber uses ringpop which is a consistent hash ring like decentralized gossip based distributed system, like the design of Dynamo we talked about before. In this way the server nodes and web service can be stored on the same machine, and the server node is not stateless anymore, how do we do load balancing? All nodes can take incoming request, if it is not supposed to handle this request, it will forward to the correct node based on the ring information. In this way, an extra hop may be needed, but it is easier to maintain and scale. Tchannel based RPC is used when different nodes talk to each other.

GeoFence:

Most of Uber service is city based, how do we tell users if designated service is available here?

A geofence refers to a human-defined geographic area (or polygon in geometry terms) on the Earth’s surface. Example below:

All this geofence information will be stored in database. But how do we query? The brute-force way is simple: go through all the geofences and do a point-in-poly check using an algorithm, but it is too slow.
Geofences can be organized into a two-level hierarchy where the first level is the city geofences (geofences defining city boundaries), and the second level is the geofences within each city. For each lookup, we first find the desired city with a linear scan of all the city geofences, and then find the containing geofences within that city with another linear scan. For more information, please check this article.

There are still a lot topics we can discussed, like uber pool, this might need another service to keep track of status of each car and make a decision if it can serve this group of riders. But it is out of scope of this article so we just stop here.

今際の国の呵呵君

Sunday, December 30, 2018

[System Design]Location Based Service

No comments:

Post a Comment