"Quality management" by Artsiom Strok

How PandaDoc Approaches Quality Management of Over 200 Microservices

Quality is an essential component of product development that can have a significant impact on both product success and user satisfaction. At PandaDoc, we’ve seen our team’s size and complexity expand rapidly in the last few years, and very quickly in this time, we’ve added over 50 microservices. It hasn’t always been easy to manage technical quality at such a rapid scale, but we’ve overcome some big challenges. Here’s how we’ve done it.

Do things that don’t scale

Customers are our top priority at PandaDoc, and our app is a mission-critical product for them. If we deliver substandard quality, we risk damaging our reputation and eroding customer trust. Our Engineering team understands how our product’s quality and reliability are crucial, so we’re always looking for ways to improve.

One of the solutions we’ve come up with has been to introduce the Ready For Maintenance (RFM) standard, which mandates specific and measurable requirements that every component of the system — whether a microservice or a library — needs to meet.

At first, we used a simple spreadsheet to track and visualize components' state, and this worked well for a time. However, as more and more microservices were added, we began facing certain issues.

One major problem was dealing with outdated data. Given the manual nature of entering and updating information, engineers would sometimes forget to add or update information in the spreadsheet, leading to missing or inaccurate data.

Another issue was untrustworthy data. Engineers tended to embellish the results of their services by ignoring some of the checks or requirements.

Other problems included verification of execution and motivation to maintain the data.

page_info
touch_app

The value of scaling

Scaling our solution was our next step. We started thinking about a solution that could continuously update, automatically validate the state, and motivate developers to improve the quality of their microservice. After some diligent searching, we found some potential solutions, but they lacked scoring features and didn’t provide adequate extensibility.

We then decided to create an app we called Backpack, prioritizing the following goals:

  • It’s easy to use and provides engineers with an explicit service state and maturity level
  • It tightens up PandaDoc’s RFM standard, unlike other engineering tools that require additional aggregation, have patchy data, and don’t include evaluation insights
  • It’s made for engineers who want to act proactively, based on the state of their services
  • It’s unified and covered in a single view

We treated Backpack as a product with all respective cycles. A dedicated product owner created a list of requirements for its MVP (Minimum Viable Product) version. We designed the app so we could easily extend it using plug-ins.

Currently, Backpack have the following plugins:

  • Metadata takes the information from Kubernetes and GitLab and gives us information about the owner, details about the component, the version of the component, the latest revision that was analyzed, links to sources and documentation, and a list of maintainers. We verify that all these metadata attributes are defined.
  • SonarQube offers a quick link to the respective component and an overview of main metrics, including security review, reliability, maintainability, coverage, and tech debt. We check the configuration, the rating for each of the dimensions (which should be A), and the code test coverage (which should exceed 80 percent).
  • Sentry gives us an overview of the issues related to the component. We check the configuration of this tool.
  • Snyk supplies information about security issues within our dependencies. We check configuration and vulnerabilities, plus deprecated, stale, and obligatory dependencies.
  • Prometheus is a monitoring system that provides information about alerts. We check the configuration of alerts and their runbooks.
  • Jaeger is a distributed tracing system we use for monitoring and troubleshooting.
  • Grafana helps us check a component’s configuration and dashboard availability.
  • SLO and Error Budget are internal tools that offer information about service SLOs and error budget information. We check configuration, defined SLOs, and remaining error budgets.

All these plugins and checks give us the ability to calculate a score. We can configure different weights for different gates. If we think something is a top priority immediately, we can scale up the score point (weight). We have a history of scores over time, so we can track an increase or decrease in any component’s score.

backpack - all projects
backpack - score
backpack - error budget

Adoption

For any tool to be effective and deliver value, it needs to be adopted by users. We recognized this and decided to implement a gamification system to encourage Backpack’s adoption.

We created a leaderboard that showed the progress of each team and began mentioning the scores of each group in meetings. This approach worked well, creating friendly competition between teams and incentivizing them to prioritize activities related to Backpack items. As teams began to compete for top spots on the leaderboard, adoption of the tool and standards increased. Engineers were more motivated to improve their components and ensure they met the required criteria, as they knew their efforts would be recognized.

In this way, the gamification system helped to encourage the adoption of the new tool and process — which ultimately led to improved product quality and reliability.

partly_cloudy_day
waving_hand

To sum up

Backpack is more than a tool to verify RFM standards. We use it for discoverability, security, and to check other rules. It has become a place for developers to aggregate information and access all available resources.

The lessons we’ve learned are to do things that don’t scale first, then solve the current problem. Once you scale to try to automate as much as possible and eliminate subjective judgments, adoption is very important, and turning it into a game may increase your success rate.

Artsiom Strok
written by
Artsiom Strok
Director of Engineering, AI and Data, at PandaDoc with over 13 years of experience in the field. I've built numerous solutions to automate document-heavy processes using AI and Machine learning capabilities and the MLOps platform to deploy, manage, and monitor machine learning models at scale. Currently, I'm leading the development of the application platform where our focus areas include dev experience, product quality and reliability.
Marharyta Heokchian
illustrated by
Marharyta Heokchian
edited by
Charles the Panda
photos and pictures by
PandaDoc and public sources
© 2022 by PandaDoc
Contact us