How PandaDoc Approaches Quality Management of Over 200 Microservices

Quality is an essential component of product development that can have a significant impact on both product success and user satisfaction. At PandaDoc, we’ve seen our team’s size and complexity expand rapidly in the last few years, and very quickly in this time, we’ve added over 50 microservices. It hasn’t always been easy to manage technical quality at such a rapid scale, but we’ve overcome some big challenges. Here’s how we’ve done it.

Part of the issue Engineering

Published on 27.04.2023

Do things that don’t scale

Customers are our top priority at PandaDoc, and our app is a mission-critical product for them. If we deliver substandard quality, we risk damaging our reputation and eroding customer trust. Our Engineering team understands how our product’s quality and reliability are crucial, so we’re always looking for ways to improve.‍

One of the solutions we’ve come up with has been to introduce the Ready For Maintenance (RFM) standard, which mandates specific and measurable requirements that every component of the system — whether a microservice or a library — needs to meet.‍

At first, we used a simple spreadsheet to track and visualize components' state, and this worked well for a time. However, as more and more microservices were added, we began facing certain issues.

One major problem was dealing with outdated data. Given the manual nature of entering and updating information, engineers would sometimes forget to add or update information in the spreadsheet, leading to missing or inaccurate data.

Another issue was untrustworthy data. Engineers tended to embellish the results of their services by ignoring some of the checks or requirements.

Other problems included verification of execution and motivation to maintain the data.

page_info

touch_app

The value of scaling

Scaling our solution was our next step. We started thinking about a solution that could continuously update, automatically validate the state, and motivate developers to improve the quality of their microservice. After some diligent searching, we found some potential solutions, but they lacked scoring features and didn’t provide adequate extensibility.

We then decided to create an app we called Backpack, prioritizing the following goals:

It’s easy to use and provides engineers with an explicit service state and maturity level
It tightens up PandaDoc’s RFM standard, unlike other engineering tools that require additional aggregation, have patchy data, and don’t include evaluation insights
It’s made for engineers who want to act proactively, based on the state of their services
It’s unified and covered in a single view

We treated Backpack as a product with all respective cycles. A dedicated product owner created a list of requirements for its MVP (Minimum Viable Product) version. We designed the app so we could easily extend it using plug-ins.

Currently, Backpack have the following plugins:

Metadata takes the information from Kubernetes and GitLab and gives us information about the owner, details about the component, the version of the component, the latest revision that was analyzed, links to sources and documentation, and a list of maintainers. We verify that all these metadata attributes are defined.
SonarQube offers a quick link to the respective component and an overview of main metrics, including security review, reliability, maintainability, coverage, and tech debt. We check the configuration, the rating for each of the dimensions (which should be A), and the code test coverage (which should exceed 80 percent).
Sentry gives us an overview of the issues related to the component. We check the configuration of this tool.
Snyk supplies information about security issues within our dependencies. We check configuration and vulnerabilities, plus deprecated, stale, and obligatory dependencies.
Prometheus is a monitoring system that provides information about alerts. We check the configuration of alerts and their runbooks.
Jaeger is a distributed tracing system we use for monitoring and troubleshooting.
Grafana helps us check a component’s configuration and dashboard availability.
SLO and Error Budget are internal tools that offer information about service SLOs and error budget information. We check configuration, defined SLOs, and remaining error budgets.

All these plugins and checks give us the ability to calculate a score. We can configure different weights for different gates. If we think something is a top priority immediately, we can scale up the score point (weight). We have a history of scores over time, so we can track an increase or decrease in any component’s score.

Adoption

For any tool to be effective and deliver value, it needs to be adopted by users. We recognized this and decided to implement a gamification system to encourage Backpack’s adoption.

We created a leaderboard that showed the progress of each team and began mentioning the scores of each group in meetings. This approach worked well, creating friendly competition between teams and incentivizing them to prioritize activities related to Backpack items. As teams began to compete for top spots on the leaderboard, adoption of the tool and standards increased. Engineers were more motivated to improve their components and ensure they met the required criteria, as they knew their efforts would be recognized.

In this way, the gamification system helped to encourage the adoption of the new tool and process — which ultimately led to improved product quality and reliability.

partly_cloudy_day

waving_hand

To sum up

Backpack is more than a tool to verify RFM standards. We use it for discoverability, security, and to check other rules. It has become a place for developers to aggregate information and access all available resources.

The lessons we’ve learned are to do things that don’t scale first, then solve the current problem. Once you scale to try to automate as much as possible and eliminate subjective judgments, adoption is very important, and turning it into a game may increase your success rate.

written by

Artsiom Strok

Director of Engineering, AI and Data, at PandaDoc with over 13 years of experience in the field. I've built numerous solutions to automate document-heavy processes using AI and Machine learning capabilities and the MLOps platform to deploy, manage, and monitor machine learning models at scale. Currently, I'm leading the development of the application platform where our focus areas include dev experience, product quality and reliability.

illustrated by

Marharyta Heokchian

edited by

Charles the Panda

photos and pictures by

PandaDoc and public sources

"Without team leads" by Ilya Kazimirovskiy

An Inside Look at How PandaDoc’s Engineering Team Operates Without Team Leads

It can be hard to imagine any team, let alone an engineering team operating without team leads, but it can be done. By moving away from traditional leadership hierarchies and towards a more collaborative environment, teams can uncover their full potential and develop innovative solutions.

Resilient Engineering: Designing Your System to Survive Pressure

PandaDoc serves more than 50,000 customers each day, and we bear a huge responsibility to provide every one of our users with reliable service. Our application consists of more than 100 services, plus dozens of front-ends developed throughout the last ten years. When working with a system this complex, an engineer can't track every component manually — this is simply beyond any human's limit.

Reliability Engineering at PandaDoc

This article will give you a glimpse into how PandaDoc manages reliability and how we improve it continuously.
In a large-scale distributed application, infrastructure is hosted in the cloud. Multiple programming languages are involved, databases, message brokers, cache servers and other open-source or vendor third-party systems, and many services interacting in complex chains of logic. Minor disruptions in one service can be magnified and have serious side effects on the whole platform.