"Reliability" by Joao Cavalheiro

Reliability Engineering at PandaDoc

This article will give you a glimpse into how PandaDoc manages reliability and how we improve it continuously.

In a large-scale distributed application, infrastructure is hosted in the cloud. Multiple programming languages are involved, databases, message brokers, cache servers and other open-source or vendor third-party systems, and many services interacting in complex chains of logic. Minor disruptions in one service can be magnified and have serious side effects on the whole platform.

When a problem happens, having the awareness that users are impacted and that the business could be losing money and trust emphasizes the ticking of the clock. Time goes by very quickly, and stress levels can be high. 

100% uptime would be great in a perfect world, but there is no such thing.

The importance of reliability for PandaDoc

PandaDoc is mission-critical software for 50,000+ customers that use it for internal document processes such as contracts, proposals, and notarization, as well as other HR, legal, and sales use cases. It is crucial for PandaDoc to ensure reliability and performance, even in the face of unexpected events or disruptions.

In recent years the best-selling author Nassim Nicholas Taleb coined the term “antifragile.” It describes the capability of a system to thrive even in the face of stressors, shocks, volatility, noise, or mistakes.

Given the unpredictable nature of events that our system may face, it's impossible to anticipate them all. Therefore, it's crucial to follow a design approach that takes into account potential failures, and to ensure the system has the necessary resilience mechanisms to handle such conditions. Moreover, we want to improve reliability without affecting the cost of development or the speed at which we can deliver service to our customers.

Key principles:

  • Automation: testing in CI/CD pipelines, automated infrastructure provisioning and control via infrastructure as code, and automated monitoring and alerting for when bad things happen.
  • Ownership: We have a pre-production checklist and release notes. Every release at PandaDoc has a risk scoring and emergency plan/rollback procedure. The team releasing the changes is responsible for assigning a release engineer and actively monitoring the release.
  • Post-mortems: We take it very seriously! Engineers own post-mortems and use them as a learning process. A post-mortem has priority over any other task. We share incident situations with everyone internally to avoid making the same mistake twice.
border_style
docs_apps_script

Visibility into what is going on

We can only manage what we measure. At PandaDoc, we have a service-oriented landscape that involves many interconnected services managed by relatively small teams. Whenever a problem happens, being able to quickly figure out what part of our stack is in trouble, who to contact, and a complete diagnosis context is essential.

Is customer experience degrading? What is the application's throughput, latency, etc.? Is my infrastructure optimized? Are we spending too much money? Can we deploy changes quickly?

Observability is the practice of instrumenting a system to generate data as it operates to provide insight into what it’s doing. When you tie together metrics, traces, and logs, you can tell a detailed story about what is happening in your services

The metrics set the scene. Is a particular service experiencing a high error rate? The traces give you the timeline and all the participating services and some clues through tags (e.g., user, request ID, document ID) that can tell you the flow between participating services. Finally, logs give you detailed insights into a particular service. When you tie everything together, you can connect the dots.

Service Level Objectives help measure the reliability of a service. For example: Take the case where service downtime is a second-order effect of high CPU. Users don’t care about your service’s CPU usage - users look at how your product performs and feels.

  • Availability: The uptime of a service. We expect all our critical flows to be available over 99.9% of the time.
  • Processing time: The time it takes for a service to respond to a request. This metric is often used to measure the responsiveness of PandaDoc.
  • Error Rate: The number of errors that occur when using a service. It’s typically measured as the ratio of errors to the total number of requests and is a proxy metric for customer pain.
  • Throughput: The number of requests a service can handle per second. This metric is often used to measure a service's scalability and trigger actions such as auto-scaling.  

Additionally, we use DORA metrics - which are leading indicators for the performance of our development team. Those cover how often we release to production, the average time it takes for a change to reach production, the percentage of deployments that caused a failure, and how quickly we can recover from a problem.

We accept errors

As a growing company, we look for ways to balance innovation and downtime/errors. For that purpose, we have adopted error budgets. This framework helps to create a shared understanding of the trade-offs between reliability and other priorities, such as feature development. 

By establishing a clear threshold for acceptable levels of errors, we can make informed decisions about when to prioritize reliability improvements over other work. Additionally, we maintain a heatmap that depicts our services in quadrants, representing each service's criticality for the business and its healthiness level, covering code complexity, known tech debt, history of incidents, and test coverage. 

draw_abstract
playlist_add_check_circle

Conclusion

Ownership is essential: Service owners need to plan for and accommodate service outages, disruptions to known and unknown dependencies, and sudden unexpected loads throughout the system. To achieve the deep knowledge required to operate a service at scale, engineers must feel responsible for what they build and have a long-term commitment to maintaining it.

Observability is a continuous improvement process: Having good tools is not enough - the quality of your data matters the most: metrics have to be clear and precise, logs have to be informative, and traces have to work. At one point, we were generating over 10Tb of logs+metrics data every month - which, even given the large size of our infrastructure, was a sign that there was too much noise and we could be missing important information. Always keep an eye on your signal-to-noise ratio.

Automate everything: We’ve made efforts to unify and standardize our tooling, have fully automated deployments, and follow a healthy test automation pyramid. Slowing down to find all potential problems before every single release wasn’t the answer - if we do so, we sacrifice velocity and increase WIP (work in progress). Moreover, automation reduces human error and toil so that engineers can focus on what matters: making PandaDoc a product that is being chosen by 50,000 customers.

Designing for failure: Reliability starts during the design process. During our system / technical architecture process (SA/TA), we review the concept of every new service or big change, including how it will fit in the current landscape, how it will scale, failure modes, and graceful degradation - the ability a service has to tolerate the possible failures with the least impact for the end user.

João Carvalho
written by
João Cavalheiro
Director of Engineering, leading the Application Platform track at PandaDoc. I’m responsible for our Infrastructure and Technology layer. With over 20 years' experience in software industry.
Marharyta Heokchian
illustrated by
Marharyta Heokchian
Tamiracle Williams
edited by
Tamiracle Williams
photos and pictures by
PandaDoc and public sources
© 2022 by PandaDoc
Contact us