Original Post from Abnormal Security
Author: Michael Kralka
At Abnormal, the problems we are trying to solve are not that much different than those being tackled by other organizations, including non-startups. What is unique to startups are the additional constraints placed on the solution space, such as the amount of time/money/engineers that can be thrown at a problem. In this series, we will explore some of the challenges we face and the unique ways we have overcome these challenges.
Serving Batch Data
Approaching the Limit for Number of Supported Mailboxes
We recently were faced with the challenge of scaling our online KV (key-value, a.k.a., NoSQL) datasets, which had become one of the biggest limiting factors for us to take additional traffic from new customers. This was primarily driven by two datasets: Behavior DB and Mailbox DB (there is a third, Person DB, but that is a story for another day). The former, as its name suggests, is a dataset containing statistics about the behavior associated with all messages sent to or from an organization (and the individuals involved). The latter is a dataset containing metadata about each mailbox (i.e., email account). Both of these datasets grow linearly with the number of mailboxes being protected. Additionally, Behavior DB grows with the average number of data points generated per message.
We knew this was going to be an issue that we had to face but made the conscious decision to delay investing in a solution until the issue became more urgent. The issue became urgent when we only had enough runway to increase the number of mailboxes being protected by approximately 25% with a projected mailbox growth of 100% in the two months that followed.
How did we end up here?
Before we start, it would be helpful to have some shared context. If we zoom all of the way out, at the heart of the real-time component of our email protection product is the Real-Time Scorer (more affectionately known as the RT Scorer):
When an email message is delivered, it is first sent to the RT Scorer to generate a judgment (e.g., safe, suspicious, malicious, etc.) and is then sent along to the Post-Scoring Processor to take action related to the judgment (e.g., quarantining the message if it is malicious, injecting a banner in the body of the message if it’s suspicious, surfacing the threat in the customer portal if it is an unsafe message, etc.). The RT Scorer extracts metadata about the message (e.g., sender, recipient, email headers, etc.), collects data related to the entities of the message (e.g., sender, recipient, attachments, etc.), then passes that data on to the various detectors to provide a judgment. The RT Scorer is a stateless service running in a docker container. It scales horizontally with message volume by adding more instances. In theory, it should be infinitely scalable.
In the early days, Behavior DB and Mailbox DB were both small. We had a dozen or so instances processing double-digit queries per second (QPS) protecting approximately ten thousand mailboxes spread across a handful of customers. Behavior DB contained about 5 million keys, which fit nicely into a 100 MB database file. Mailbox DB was smaller and could easily fit into 20 MB. These datasets were (and still are) generated in batch (using Spark) and are updated daily.
Back then, we were faced with the difficult decision about how to make these datasets available to our detection system. Like many decisions, this one was about tradeoffs. In the startup life, time is precious and these decisions must be made quickly; there generally isn’t enough time to complete an in-depth evaluation of each viable option. Using familiar technologies (even though it may not be a perfect fit) and using them in simple ways brings velocity and a high chance of success. For the first version, we ultimately chose to keep the system simple and distribute the datasets to all RT Scorer servers so that they could access the data locally and not have to worry about the complexity of adding a remote procedure call into what was a self-contained, monolithic service. This simplicity and speed came at the cost of technical debt of a system that would not scale well. Behavior DB was deployed as a RocksDB database and was accessed via a library, directly off of disk. Mailbox DB was deployed as a compressed CSV file and was loaded into memory.
As our detection system became more sophisticated, the number of keys per message being materialized in the Behavior DB grew and so did its size. As the number of mailboxes we were protecting grew, so did the size of both datasets. By May 2020, both Behavior DB and Mailbox DB had grown by two orders of magnitude. The process of loading, unpacking, and opening these datasets was now taking more than 10 minutes and used a considerable portion of memory. Since the RT Scorer could not function without these datasets, this severely impacted our ability to launch new instances, affecting auto-scaling operations (it was sluggish), deploys (what used to take 10 minutes was taking over an hour), and S3 transfer costs (which were inflated). Mailbox DB would soon exhaust the available memory on the RT Scorer and to increase this (because of some fundamental technical details of how the RT Scorer is built) would mean running the servers with twice the memory and CPU (at twice the cost) without being able to utilize the additional compute power.
Assessing our options
We were faced with a problem. We were projecting the number of mailboxes to grow by a factor of five by the end of the year (which turned out to be spot on) and then by another factor of ten by the end of the following year. More urgently, the mailbox count was set to double in the next two months.
We faced the following challenges:
- A compressed time schedule (we had 3–4 weeks to roll out a solution)
- Limited resources available (we had to continue to support other development, leaving only 1–2 engineers to work on a solution)
Our solution must have met the following requirements:
- Support growth rate of the next year (a factor of 50)
- Support very large customers with hundreds of thousands of mailboxes
- Eliminate the long startup problem
- No (or negligible) impact on processing latency
Given these requirements and restrictions, we looked at our options. What follows are the most promising ones:
Option 1: Partition RT Scorer by customer and scale horizontally by adding more clusters as more customers are added
- Low engineering effort (customers are already partitioned to some extent so all the pieces are already in place)
- “Infinitely” Scalable (by adding partitions/clusters as customers are added)
- Low risk (because the RT Score is used as-is but deployed differently)
- Additional operational overhead of managing separate clusters. Partitioning is sticky (A customer can’t easily be move between clusters); need to assign/route each customer’s traffic to the right cluster, complicating the routing layer; and can lead to imbalanced partitioning if no corrective action is taken.
- Doesn’t address slow launch problem.
- Doesn’t support very large customers that can’t fit on a single partition.
Option 2: Scale the RT Scorer vertically by adding bigger disks, adding more memory, and caching datasets locally
- Solves the slow launch problem.
- No additional operational burden.
- Low risk
- Won’t support growth by a factor of 50 (limited to a factor of about 20)
- Data distribution continues to be a problem (How to efficiently distribute hundreds of GB?)
- Checks off all the boxes
- High risk (significant changes would be required to support a new data model)
- Remote access is more complicated than local access.
- Likely requires changing how data is generated.
- Not many off-the-shelf options for serving batch-generated KV data: Voldemort (seems abandoned and has known latency issues when updating datasets); Terrapin (was abandoned).
Option 3 (publishing data to a KV Store) seemed the most promising. However, it would introduce some risk.
- May require changes to how data is distributed (currently, data is pulled from S3).
- Would require substantial effort to validate data parity between new and old datasets.
- May require managing a new, complex server for which the team has limited experience.
Is there a way to get the best of both worlds? Could we take our existing RocksDB datasets and serve them from a centralized service? If so, this would eliminate much of the risk.
The approach we took was to use the current RocksDB dataset and serve it using rocksdb-server. rocksdb-server a simple service that exposes a Redis-compatible API on top of the RocksDB API. It serves a single database at a time and implements a tiny subset of the supported commands. This is a great solution for our use case because:
- We have experience with Redis and already have a Redis client and tools for accessing Redis servers.
- It uses the exact same dataset format, eliminating the risk associated with a new data format and any discrepancies that could creep in.
- rocksdb-server is very simple, further reducing risk by making it easier to debug, fix, and extend.
- Limited mostly by the size of disk that is available.
- Low latency (under load we achieved sub 1ms p99 latency)
With these new servers, the RT scorer diagram now looks like the following:
Introducing changes to any system comes with risk. In general, the more significant the change, the higher the risk. There are several patterns that can be followed to help de-risk any project. In this case, we chose the following strategies:
- De-risk the high-risk items as soon as possible
- Break down execution in Crawl/Walk/Run phases
- Launch in “dark mode” to validate the solution
De-risking the high-risk items
High-risk items can upturn an otherwise sound solution and waiting too long to figure this out can jeopardize its delivery. For this solution, there were a few things that we didn’t know or were worried about. In this particular case, most of our worries were around the rocksdb-server, since we had no experience with it. Specifically:
- Will it be performant enough?
- How many replicas will we need?
- What latency will we likely see?
- Does it support the functionality we need and/or can it be easily extended if it doesn’t?
The approach we took to de-risk this solution was to first create a docker image that would download an arbitrary dataset from S3 and launch rocksdb-server pointing to that dataset. Once this was done, it became possible to test all of the performance-related risks. More on this later.
Crawl/Walk/Run is an approach to delivering value that encourages early successes. It is an invaluable technique that demonstrated its value in this project. Thinking this way helps you focus on small improvements that deliver value without having to build the perfect solution right away. Build something small, learn, and improve over time. To solve our problem, all we have to do is crawl; we don’t need to run.
For the crawl, we were able to sacrifice some of the functionality that would be needed in the future but not needed now. E.g.:
- Horizontally Scalable as the datasets increases in size
Although we can horizontally scale to higher QPS by deploying more replicas of the data, we must vertically scale to machines with larger disks and RAM as the datasets increase in size. The Redis protocol supports sharded datasets, which rocksdb-server could be extended to support.
- Efficient distribution of datasets
We ended up productionalizing the proof-of-concept docker container that downloaded the datasets at launch because it was good enough for now. This can be addressed later when the datasets are too big and this is no longer a practical solution.
- Live dataset updates
Updating the version of data being served requires restarting the service (triggering a download of the latest version). Although this is a bit awkward, a load balancer can gracefully handle instances disappearing and it greatly simplifies the solution.
- Server-side metrics
On the server side, we are limited to system-level metrics (CPU and memory utilization) without any application metrics (e.g., latency, QPS, etc.). It would be great to have these, but would require more changes to rocksdb-server. The client side metrics, although far from perfect, can fill in a lot of the gaps here.
Sure, without this functionality our solution is very rough around the edges, but it’s solving a big problem.
Even with the most thorough testing, it’s hard to say with absolute certainty that an implementation will not break things. This is especially true when replacing one already-working-thing (that others are depending on) with something else. A common technique to mitigate this risk is the dark launch. In a dark launch, the new solution is exercised to its fullest but results are not used. Typically, it’s controlled by a knob that can send anywhere from 0% to 100% of traffic through the new solution.
In the case of Behavior DB and Mailbox DB, our dark launch consisted of the following functionality.
- Fetch data from the primary storage (ininitaly, this was local storage)
- Check to see if this request is selected for dark read, if not return the results from the primary storage.
- Asynchronously fetch data from the secondary storage (initially, this was remote storage)
- Return the results from the primary storage.
- When the results from the secondary storage becomes available, compare the results to that obtained from the primary storage and record the results.
This was controlled by dynamic configuration that indicated the percentage of requests that should be asynchronously fetched from secondary storage. A second flag controlled which percentage of requests used the remote storage as primary instead of the local storage. With these two flags we were able to slowly increase the traffic sent to the new rocksdb-server instances and observe how they performed as well as transition the remote storage from dark mode to live mode without having to redeploy any code.
By modifying these flags and looking at the comparison and latency metrics of the remote storage, we were able to be certain that the new remote storage provided byte-for-byte identical responses to the existing local storage and that the latency was within specification, all without any negative impact to production traffic.
We had a smooth rollout. All of the instrumentation we added was able to uncover issues that we were able to fix. Some of these fixes came in the form of improvements to rocksdb-server (which can be found on the scale-up branch of this fork).
We experienced two main issues during dark launch: excessive latency and discrepancies between old and new storage.
The source of the discrepancies was an interesting problem to track down (and worthy of being covered with greater depth in a follow-up post). In short, the problems that we encountered had two main causes:
- rocksdb-server and RT Scorer instances would update at different times, resulting in discrepancies in the comparisons until they became synchronized again, and
- A latent bug in how we had split customer traffic between two clusters was not consistent with how the data was split so some data was missing for some customers (with a 0.5% miss rate).
Tracking down latency issues was a similarly interesting problem to solve (also worthy of being covered in a follow-up post). Briefly, we tracked down the following sources of latency:
- Opening the database in read/write mode
In our use case, we are serving read-only data. The RocksDB library can open a database in read-only mode, which will disable a significant amount of locking, allowing for multiple threads to read at the same time without blocking. Additionally, we used this to reject any mutating request before even hitting the database.
- Accessing RocksDB from libuv event loop
libuv is an asynchronous I/O library that works on the premise of a single event loop thread. If a call into the RockDB library from the event loop thread blocks, it blocks all connections and serializes access to the database. Ideally, RocksDB library would use libuv to asynchronously read from disk. Since it does not, calling into the RocksDB library was moved to a thread pool.
- Writing diagnostic log events in libuv event loop
As we made changes, we added a lot of diagnostics to rocksdb-server. Similarly to accessing RockDB, this can stall the libuv event loop. A buffer was added and flushing was moved to use async I/O.
- Noisy Neighbors
We initially deployed these servers in shared general-purpose infrastructure. Although this infrastructure has good memory and CPU isolation, it does not have any disk isolation. Further, disk space was provided by network attached storage, which can have slower I/O times. In this environment, our services were often deployed on the same hardware as other containers with high disk I/O, our I/O suffered. To address this, we moved our service to dedicated resources with directly connected, fast SSD drives, which made a huge difference.
Overall, I think this was a very interesting and successful project. We were able to swap out the database from our application without any impact in production, reduced startup time from 10+ minutes to 30 seconds, and gave us enough runway to work on improving the “Crawl” and to focus on solving other problems.
Behavior DB and Mailbox DB have each grown 3x in size since we started working on our read-only KV Store server six months ago. Although they are still holding up, there are some short- and long-term improvements we would like to make.
- Improved Visibility
We use prometheus to monitor all of our services. We would like to add support for exporting metrics in this format, giving us server-side metrics.
- Live Updates
In order to reduce downtime and more efficiently distribute the datasets to the nodes, we’d like to introduce a sidecar that manages the data on disk and notifies the main service when more data is available.
- Data Sharding
We would leverage Redis’ cluster extensions to support splitting data across multiple shards to serve very large datasets.
- Other Storage Formats
RocksDB data format is great for read/write datasets but is suboptimal for read-only datasets. This will soon become a bottleneck for generating our datasets in batch. Support for a file format that is MapReduce friendly (such as Voldemort’s read-only format) would improve this greatly.
Beyond improvements to how we serve batch data, there is still a ton more to do to help scale the business! We need to figure out how to efficiently scale the RT Scorer 10x and beyond. As a force multiplier for the organization, the infrastructure/platform team needs to figure what common design patterns to provide to engineers (e.g., data processing, storage solutions, etc) and how to better support a growing engineering team by making it easier for engineers to build, test, deploy, and monitor low-latency, high-throughput services so engineers can spend more time solving customer problems and less time deploying and maintaining services. If these problems interest you, yes, we’re hiring!
Go to Source
Author: Michael Kralka