Problem: Design a location based service like Uber?
Scenario:
Rider should be able to request a service, an available driver should be matched
We should be able to keep track of driver's location->drivers report location every 4 seconds(based on this article)
Services:
Geo Service: stores drivers location information, handle queriers like "give me drives within 2kms of this location". Since drivers report location information every 4 seconds, it will be write heavy. Assume 0.2M active drivers per day, QPS will be 200000 / 4 = 50K, Peak QPS might be 50 * 3 = 150K, writes need to be efficient
Dispatch Service: To handle riders request, matches the driver and keep track of the trip information
Storage:
GeoService:
Driver's location, schema:
Field | Type | Detail |
---|---|---|
driver_id | primary key | user id |
longitude | float | |
latitude | float | |
status | int | if available |
created_at | timestamp |
Dispatch Service:
Trip information, schema:
Field | Type | Detail |
---|---|---|
trip_id | primary key | |
rider_id | foreign key | user id |
driver_id | foreign key | user id |
status | int | waiting/driver on the way/in trip/ended/cancelled |
longitude | float | |
latitude | float | |
created_at | timestamp |
Of course we will need other tables to store riders and drivers informations, like their personal details, car information and status, but they are not the crucial part so we won't talk about it in details.
Both database will be SQL based, for location table, we need to index by latitude and longitude. For trip table, we will also need to index by trip_id, rider_id, driver_id.
Naive workflow:
- Driver reports location to Geo service every 4 seconds and update location information
- Rider request taxi, a new entry is created in trip table, trip id returned
- On the backend, the dispatch service send a query to Geo service with rider's currently location
- A list of available drivers are returned and matching algorithm is performed to find a match
- The trip entry will be modified based on the match found
- Both riders and drivers keep asking dispatch service if there is a match, and return the result
- Driver will send/ask both Geo service and dispatch service periodically, we can combine those information in a single request and set to Dispatch service, then Dispatch service send request to Geo service to update the information asynchronously
- Step 3 will be super inefficient, if we want to query based on difference delta, it will be 2 range queries in a SQL database(currlong - delta <= longitude <= currlong + delta, currlat - delta <= latitude <= currlat + delta)
- Step 1 will also be inefficient since we are updating SQL table with index really often
- If we don't store lat and longt, what should we store?
How to store GeoIndex:
There are many solution, but we will just talk about this two:
- Google S2
- Divides the earth into tiny cells using the Google S2 library. Each cell has a unique cell ID
- Using an int64 every square centimeter on earth can be represented. Uber uses a level 12 cell, which are 3.31 km to 6.38 km, depending on where on earth you are. The boxes change shape and size depending on where on the sphere they are
- S2 can give the coverage for a shape. If you want to draw a circle with a 1km radius centered on London, S2 can tell what cells are needed to completely cover the shape
- GeoHash
- A geohash actually identifies a rectangular cell: at each level, each extra character identifies one of 32 sub-cells
- Recursively divide the region and reach the granularity we want
- 32base string, 0-9 a-z without a i l o, the longer the string, the more accurate it is
By using this we can change the location table to be some like this:
Field | Type | Detail |
---|---|---|
Geohash/Cell id | primary key | represent a small region in earch |
drivers | {dirver1, driver2,...} |
This will be key-value storage and we can use Nosql database here to improve the read/write speed. We can also store different level of cells and geohash string to represent different size of regions and we can query smaller regions first than larger.
Optimized workflow:
Optimized workflow:
- Driver reports location to Geo service every 4 seconds, update is location information
- Rider request taxi, a new entry is created in trip table, trip id returned
- On the backend, the dispatch service send a query to Geo service with rider's currently location
- Geohash with specific prefix/Cells ids included in nearby xkm is queried and a list of available drivers are returned
- Matching algorithm is performed to find a match
- The trip entry will be modified based on the match found
- Both riders and drivers keep asking dispatch service if there is a match, and return the result
Scale:
To support high QPS, scaling is definitely needed. For the storage nodes, location table can be sharded based on geohash/cell id. Trip table can be sharded by trip id. Multiple front end server is also need to do load balancing. We can also use cache to speed up reading and database replicas can also split reading traffics.
Uber uses ringpop which is a consistent hash ring like decentralized gossip based distributed system, like the design of Dynamo we talked about before. In this way the server nodes and web service can be stored on the same machine, and the server node is not stateless anymore, how do we do load balancing? All nodes can take incoming request, if it is not supposed to handle this request, it will forward to the correct node based on the ring information. In this way, an extra hop may be needed, but it is easier to maintain and scale. Tchannel based RPC is used when different nodes talk to each other.
GeoFence:
Most of Uber service is city based, how do we tell users if designated service is available here?
A geofence refers to a human-defined geographic area (or polygon in geometry terms) on the Earth’s surface. Example below:
All this geofence information will be stored in database. But how do we query? The brute-force way is simple: go through all the geofences and do a point-in-poly check using an algorithm, but it is too slow.
Geofences can be organized into a two-level hierarchy where the first level is the city geofences (geofences defining city boundaries), and the second level is the geofences within each city. For each lookup, we first find the desired city with a linear scan of all the city geofences, and then find the containing geofences within that city with another linear scan. For more information, please check this article.
There are still a lot topics we can discussed, like uber pool, this might need another service to keep track of status of each car and make a decision if it can serve this group of riders. But it is out of scope of this article so we just stop here.
No comments:
Post a Comment