Resilient Engineering: Designing Your System to Survive Pressure

PandaDoc serves more than 50,000 customers each day, and we bear a huge responsibility to provide every one of our users with reliable service. Our application consists of more than 100 services, plus dozens of front-ends developed throughout the last ten years. When working with a system this complex, an engineer can't track every component manually — this is simply beyond any human's limit.

So how do we work around this cognitive limitation to design and stably operate a system full of such complexity? To ensure PandaDoc's components are resistant to unexpected issues such as load spikes and infrastructure failures, we use a set of engineering practices and design patterns called “resilient engineering.” This article describes some of them on the entire scale of software design and engineering.

Explicit limits in SQL queries

How much harm an SQL query like this can do?

SELECT id, name, date FROM tbl WHERE parent_id = 5;

The answer here depends on the amount of data within this table, in addition to the type of hardware running the database and its load. This query may run perfectly fine on your development laptop, within your staging environment, or even within your test suite, where the dataset is small.

But in an actual production environment where the table may contain millions of rows that match the “WHERE” condition, this query may return millions of rows. Simply pulling them from the disk and putting them into the database cache can be challenging enough. Transferring them over the network into your application process, then placing them into memory can also lead to memory overflow. Each of these actions present risks that could adversely affect your application and disrupt your service.

To make this query safe, make one small change:

SELECT id, name, date FROM tbl WHERE parent_id = 5 LIMIT 10;

Adding this “LIMIT” clause places an upper limit on the number of rows the database will look for and return. So in this example, no more than 10 rows can be returned. The exact number of the row limit depends on the use case at hand. For example, 10 may be a reasonable number if your interface displays 10 items per page.

Sometimes people even add a "LIMIT 1" clause to queries that are expected to return just one row by primary key, but you could call this a bit paranoid and not go so far. On the other hand, if you imagine that your database could someday start using another column for the primary key, it might be better to be safe than sorry.

Another idea that takes this practice even further is to add an automated check that will write a log notice when you run a query without an explicit “LIMIT” clause. This way, you'll be sure you won't miss it.

youtube_searched_for
close_fullscreen

Naval engineering meets software engineering

This pattern is inspired by naval engineering.

Ships and boats have bulkheads that create separate watertight compartments containing water in the case of a hull breach. This gives the entire vessel a better chance of surviving such a breach by allowing other compartments to remain dry, while also providing buoyancy.

Now for an example of how we’ve applied this pattern at PandaDoc.

Our quote service used to be deployed as a single cluster handling two types of workloads: fast synchronous RPCs for CRUD-like operations, and asynchronous processing of bulk updates. Once the asynchronous workload would get a one-off spike in load, it affected the RPC calls, and the CPUs serving both workloads would max out. The unwanted result was our customers experiencing delays in quote creation.

To resolve this issue, we’ve split the quote service cluster into two separate compartments — one for each workflow. Using this revised setup that takes inspiration from a vessel’s bulkhead, issues in one workload no longer affect the other.

Our general idea has been to identify which functions of the systems can affect each other, then make them as independent from one another as possible while prioritizing the most critical ones.

In the example above, a user’s ability to quickly create a quote is time-critical. After all, this is one of the main workflows of our application. On the other hand, our users don't assume that background updates will be done instantly, even if our interface is designed to explicitly communicate that it may take some time. Keeping this in mind, we’ve prioritized speed and stability of quote creation over the speed of background updates and additional infrastructure complexity caused by splitting a cluster into two parts.

Graceful degradation

When you work with a distributed system, chances are high that one component will fail, due simply to the number of separate components. At PandaDoc, we acknowledge this by planning for it. Designing communication between components should take failure modes into account.

So, how should a system behave if a component fails?

The “graceful degradation” approach advocates for being prepared for a major component’s failure, then having a fallback plan in place that will allow users to continue their work, even with limited functionality.

One example at PandaDoc is our Add-On Store, which allows users to manage their own add-ons. This service stores the state of all enabled add-ons across accounts and users. If the Add-On Store goes down, users will be unable to manage their add-ons. However, our main application will remain intact because we store a cached version of the enabled add-ons’ state outside the service.

Another example is how we handle document editor components such as the quote block. In case our quote service becomes non-responsive, our document editor will render a placeholder while keeping all other blocks available for editing.

refresh
speaker_notes

Preventive chaos

Often, you learn about resilience shortcomings and overly tight coupling between system components the hard way: One minor element goes down, then a chain of cascading failures takes the whole system down with them.

This is a passive approach, and we prefer one that’s much more proactive.

To uncover failure modes within our system, we've created a step-by-step process:

  1. Deploy the whole system to a staging environment
  2. Disable one service or infrastructure component (eg. DB, message broker)
  3. Run a full suite of end-to-end tests
  4. Review failed tests one by one, then come up with actions to mitigate the failure
  5. Repeat for every service/component

After running this process, we usually end up with a list of tasks that includes adding absent timeouts for remote calls, circuit breakers, dead letter queues, or redesigning part of the interface to allow graceful degradation. We put these at the top of the backlog and prioritize addressing them.

This approach is super-scrappy and eliminates the need for fancy tools like Netflix's Chaos Monkey. The only prerequisite to getting started is having a staging environment and a reasonable end-to-end test suite. The trade-off is that setting up the environment and analyzing test results is a manual job that takes time. Our experience is that one engineer can run this flow for 4-5 services in a single day, so if you have hundreds of services at hand, you’ll be able to do this only once in a while.

Mick Amelishko
written by
Mick Amelishko
Senior Director of Engineering at PandaDoc. I’ve worked in tech for 18 years and started as a do-it-all Web engineer before co-founding a startup. I later joined PandaDoc as an early employee, and in the years since, I’ve helped create and shape the company’s engineering culture while growing our team more than 25x.
Marharyta Heokchian
illustrated by
Marharyta Heokchian
edited by
Charles the Panda
photos and pictures by
PandaDoc and public sources
© 2022 by PandaDoc
Contact us