DoorDash's Data Engineering
A glimpse of how the Engineering team at DoorDash functions behind the scenes.
Introduction
DoorDash has over 30 million users worldwide, and the entire engineering and operations are handled by about 17,000 employees. DoorDash reported a revenue of about $7 billion USD in 2022, which is a 34% increase compared to 2021 [more such statistics reported here]. Their backend team of over 2000 engineers handles all technical operations for this huge user base with millions of requests everyday.
The Rise of a Successful Startup
In 2019, DoorDash completed 263 million orders. In 2020, they completed 816 million orders. With approximations, some reports predict that DoorDash could have served about 1.5 to 2 billion requests in 2022, which is an average of 4-5 million per day. This shows a rapid growth in the usage of their application and their user base.
Companies like DoorDash found perfect strategies to seize the market in favorable conditions (during the pandemic and the era of remote work). Tom Alder mentions these business strategies in a LinkedIn post.
DoorDash has a strong brand recognition and loyal customer base. Early on, the company targeted restaurants with significant users and lacking delivery services, instead of competing with newer restaurants that have multiple delivery providers.
The startup established a symbiotic relationship with partnered restaurants, by providing data that can help the restaurants (like popular dishes in the area, customer demographics, delivery times, etc.).
This YC backed company focused on outer-suburbs with bigger family homes, a region with a much higher average order value than cities, but fewer modes of transport.
During the pandemic, they supported their delivery agents and restaurants with timely payments and other benefits, eventually establishing a stronger connection with its partners.
DoorDash’s Infrastructure: A closer look
Let’s dive deeper to see the databases that power DoorDash.
CockroachDB
CockroachDB is a relatively newer open-sourced relational database, that is used for storing data that needs to be highly/readily available and scalable, such as customer, order, restaurant and dasher(DoorDash’s term for driver) data. It also helps the engineering teams in running analytics and making predictions. This is a distributed, scalable, and fault-tolerant database that is designed to handle high volumes of data. CockroachDB is horizontally scalable, meaning it can be scaled by adding more nodes to accommodate a higher load. This is probably a good choice for DoorDash because the database can be scaled to meet the company's evolving needs and it is highly available, meaning that it can be always up and running. We will revisit why DoorDash switched to CockroachDB later in this article.
Redis
Redis is a key-value store database, where the data is stored in key-value pairs, similar to a dictionary in Python. This type of database is mostly used in cases where small amounts of data needs to be quickly accessed. Redis is a good choice for caching and real-time applications, basically those which need high performance and low latency. However, a huge benefit of using Redis is that the data is never lost, as long as there are other functional nodes in a cluster. DoorDash uses Redis to store real-time data, and for caching session data or restaurant menus.
Elasticsearch
Elasticsearch is a document database, meaning the data is stored as documents. Unlike Redis, Elasticsearch can replicate everything that is written to all nodes even before they are acknowledged. This ensures that data is always available, even when a node fails. Redis writes are not replicated to all nodes before they are acknowledged, which leaves a small window of time where the data could be lost when a node fails. Elasticsearch (as the name suggests) is a good choice for search and analytics, and hence, DoorDash uses this in searching for restaurants, menus, customer reviews, and for running analytics on customer data.
Snowflake
Snowflake is a highly scalable data warehouse designed for storing and analyzing large amounts of data (BI and Analytics). It can be scaled horizontally (by adding more nodes) and vertically (increasing capacity of each node). Unlike Redis and Elasticsearch, Snowflake can ingest and process real-time data, which is one of the main reasons why DoorDash uses Snowflake to make routing decisions and optimize delivery routes in real-time.
DoorDash also uses other technologies to support their backend databases. Here are two of them:
Apache Kafka
Kafka is used for streaming data between different systems. DoorDash uses Kafka as a real-time event streaming distributed platform. Kafka is used to process a variety of events:
Orders: Kafka is used to process order events, like when an order is placed, picked up, and delivered.
Payments: DoorDash uses Kafka to process in-app purchases like a customer placing an order, and subscription payments.
Customers: All events like user sign-ups, account creation, user profile updates, and customer placing an order are processed using Kafka.
Docker
Docker is used for containerizing applications and make it easier to manage and deploy them. DoorDash creates docker images for each of its databases, and these images contain all the files and libraries needed to run the database. They also use Kubernetes as a data orchestration tool, to manage these containerized applications. This helps ensure their database containers are running and available when needed.
DoorDash’s high volume of requests is an implication of the company’s scale of success. Being one of the leading food delivery services in the world, there is a constant need for infrastructural and organizational growth. To ensure they have a reliable data infrastructure in place, they also use a variety of techniques like redundant systems (duplicated nodes or clusters) and disaster recovery plan implementation.
The Big Change: Database Migration
DoorDash recently switched to Snowflake and CockroachDB, and these decisions were driven by a number of factors. DoorDash claims to have observed performance and maintenance issues when using a relational database (PostgreSQL), like increased latency , slower SQL inserts (due to a single writer/multiple read replica kind of database cluster), and increased CPU usage to more than 80%. Additionally, the software teams discovered several anti-patterns (recurring problems that can risk a system to be highly counterproductive) that needed to be fixed to improve system reliability and scalability (in some cases, these anti-patterns could be the cause of plateaued system efficiencies).
To solve the problem of antipatterns, the engineering teams explored solutions like splitting a large table into multiple tables, and using AWS S3 blob storage to store actual data while using the S3 bucket URLs in the database tables.
Why CockroachDB?
The team decided to use CockroachDB to solve the scalability problem once and for all, while also maintaining a structured, tabular model based on SQL. They made the move to CockroachDB because it’s based on a shared-nothing database architecture (meaning every node in the system is independent and has its own local storage), making it resilient to failures and more scalable, since nodes can be added or removed without any downtime or disrupting the system.
Schema changes: The engineering teams decided to modify the schemas to leverage the underlying database strengths. Though this was a tedious process, they believed that updating their schemas can lead to the best use of existing data.
Column families: They decided to separate all the columns into groups of column families, since some data like price and availability tend to change more often than others. After updating frequently updated columns into a separate family, updating only the non-static columns instead of full record replacement improves read/write speed to the database. CockroachDB also supports JSONB column type which allows indexing a field and check constraints.
Deprecate SQL joins: SQL joins over huge data tables can be resource intensive, time consuming, and expensive, especially when using a cloud service. The teams replaced SQL joins with downstream service calls and simple select statements, so that the output can be joined in-memory with code.
Mulit-row manipulation: To insert/update (UPSERT in CockroachDB - UPDATE or INSERT), and delete multiple records, they use multi-row statements instead of multiple single-row statements for DML queries which significantly improves the write speed to the database.
The engineering teams first made CockroachDB as a shadow with the legacy database as primary. To avoid keeping CockroachDB in the critical flow until the final switch, they made a conscious choice to populate the new database with async calls to ensure any function calls do not block the current threads of execution.
Over time, they the engineering teams compared the outputs and fixed any issues that caused a difference in output. Once they gained confidence that the reads were consistent across both data stores, they flipped the primary database to CockroachDB and shadowed the legacy database as a rollback option.
DoorDash seized the food delivery market at its prime, took strategic steps to establish its presence in the market, expanded its user base, engineered the backend tech stacks to scale their systems, and finally succeeded in becoming one of the leading food delivery companies in the world as a YCombinator startup.