This article will give you a glimpse into how PandaDoc manages reliability and how we improve it continuously.
In a large-scale distributed application, infrastructure is hosted in the cloud. Multiple programming languages are involved, databases, message brokers, cache servers and other open-source or vendor third-party systems, and many services interacting in complex chains of logic. Minor disruptions in one service can be magnified and have serious side effects on the whole platform.
When a problem happens, having the awareness that users are impacted and that the business could be losing money and trust emphasizes the ticking of the clock. Time goes by very quickly, and stress levels can be high.
100% uptime would be great in a perfect world, but there is no such thing.
PandaDoc is mission-critical software for 50,000+ customers that use it for internal document processes such as contracts, proposals, and notarization, as well as other HR, legal, and sales use cases. It is crucial for PandaDoc to ensure reliability and performance, even in the face of unexpected events or disruptions.
In recent years the best-selling author Nassim Nicholas Taleb coined the term “antifragile.” It describes the capability of a system to thrive even in the face of stressors, shocks, volatility, noise, or mistakes.
Given the unpredictable nature of events that our system may face, it's impossible to anticipate them all. Therefore, it's crucial to follow a design approach that takes into account potential failures, and to ensure the system has the necessary resilience mechanisms to handle such conditions. Moreover, we want to improve reliability without affecting the cost of development or the speed at which we can deliver service to our customers.
Key principles:
We can only manage what we measure. At PandaDoc, we have a service-oriented landscape that involves many interconnected services managed by relatively small teams. Whenever a problem happens, being able to quickly figure out what part of our stack is in trouble, who to contact, and a complete diagnosis context is essential.
Is customer experience degrading? What is the application's throughput, latency, etc.? Is my infrastructure optimized? Are we spending too much money? Can we deploy changes quickly?
Observability is the practice of instrumenting a system to generate data as it operates to provide insight into what it’s doing. When you tie together metrics, traces, and logs, you can tell a detailed story about what is happening in your services.
The metrics set the scene. Is a particular service experiencing a high error rate? The traces give you the timeline and all the participating services and some clues through tags (e.g., user, request ID, document ID) that can tell you the flow between participating services. Finally, logs give you detailed insights into a particular service. When you tie everything together, you can connect the dots.
Service Level Objectives help measure the reliability of a service. For example: Take the case where service downtime is a second-order effect of high CPU. Users don’t care about your service’s CPU usage - users look at how your product performs and feels.
Additionally, we use DORA metrics - which are leading indicators for the performance of our development team. Those cover how often we release to production, the average time it takes for a change to reach production, the percentage of deployments that caused a failure, and how quickly we can recover from a problem.
As a growing company, we look for ways to balance innovation and downtime/errors. For that purpose, we have adopted error budgets. This framework helps to create a shared understanding of the trade-offs between reliability and other priorities, such as feature development.
By establishing a clear threshold for acceptable levels of errors, we can make informed decisions about when to prioritize reliability improvements over other work. Additionally, we maintain a heatmap that depicts our services in quadrants, representing each service's criticality for the business and its healthiness level, covering code complexity, known tech debt, history of incidents, and test coverage.
Ownership is essential: Service owners need to plan for and accommodate service outages, disruptions to known and unknown dependencies, and sudden unexpected loads throughout the system. To achieve the deep knowledge required to operate a service at scale, engineers must feel responsible for what they build and have a long-term commitment to maintaining it.
Observability is a continuous improvement process: Having good tools is not enough - the quality of your data matters the most: metrics have to be clear and precise, logs have to be informative, and traces have to work. At one point, we were generating over 10Tb of logs+metrics data every month - which, even given the large size of our infrastructure, was a sign that there was too much noise and we could be missing important information. Always keep an eye on your signal-to-noise ratio.
Automate everything: We’ve made efforts to unify and standardize our tooling, have fully automated deployments, and follow a healthy test automation pyramid. Slowing down to find all potential problems before every single release wasn’t the answer - if we do so, we sacrifice velocity and increase WIP (work in progress). Moreover, automation reduces human error and toil so that engineers can focus on what matters: making PandaDoc a product that is being chosen by 50,000 customers.
Designing for failure: Reliability starts during the design process. During our system / technical architecture process (SA/TA), we review the concept of every new service or big change, including how it will fit in the current landscape, how it will scale, failure modes, and graceful degradation - the ability a service has to tolerate the possible failures with the least impact for the end user.
It can be hard to imagine any team, let alone an engineering team operating without team leads, but it can be done. By moving away from traditional leadership hierarchies and towards a more collaborative environment, teams can uncover their full potential and develop innovative solutions.
Quality is an essential component of product development that can have a significant impact on both product success and user satisfaction. At PandaDoc, we’ve seen our team’s size and complexity expand rapidly in the last few years, and very quickly in this time, we’ve added over 50 microservices. It hasn’t always been easy to manage technical quality at such a rapid scale, but we’ve overcome some big challenges. Here’s how we’ve done it.
PandaDoc serves more than 50,000 customers each day, and we bear a huge responsibility to provide every one of our users with reliable service. Our application consists of more than 100 services, plus dozens of front-ends developed throughout the last ten years. When working with a system this complex, an engineer can't track every component manually — this is simply beyond any human's limit.