Quality is an essential component of product development that can have a significant impact on both product success and user satisfaction. At PandaDoc, we’ve seen our team’s size and complexity expand rapidly in the last few years, and very quickly in this time, we’ve added over 50 microservices. It hasn’t always been easy to manage technical quality at such a rapid scale, but we’ve overcome some big challenges. Here’s how we’ve done it.
Customers are our top priority at PandaDoc, and our app is a mission-critical product for them. If we deliver substandard quality, we risk damaging our reputation and eroding customer trust. Our Engineering team understands how our product’s quality and reliability are crucial, so we’re always looking for ways to improve.
One of the solutions we’ve come up with has been to introduce the Ready For Maintenance (RFM) standard, which mandates specific and measurable requirements that every component of the system — whether a microservice or a library — needs to meet.
At first, we used a simple spreadsheet to track and visualize components' state, and this worked well for a time. However, as more and more microservices were added, we began facing certain issues.
One major problem was dealing with outdated data. Given the manual nature of entering and updating information, engineers would sometimes forget to add or update information in the spreadsheet, leading to missing or inaccurate data.
Another issue was untrustworthy data. Engineers tended to embellish the results of their services by ignoring some of the checks or requirements.
Other problems included verification of execution and motivation to maintain the data.
Scaling our solution was our next step. We started thinking about a solution that could continuously update, automatically validate the state, and motivate developers to improve the quality of their microservice. After some diligent searching, we found some potential solutions, but they lacked scoring features and didn’t provide adequate extensibility.
We then decided to create an app we called Backpack, prioritizing the following goals:
We treated Backpack as a product with all respective cycles. A dedicated product owner created a list of requirements for its MVP (Minimum Viable Product) version. We designed the app so we could easily extend it using plug-ins.
Currently, Backpack have the following plugins:
All these plugins and checks give us the ability to calculate a score. We can configure different weights for different gates. If we think something is a top priority immediately, we can scale up the score point (weight). We have a history of scores over time, so we can track an increase or decrease in any component’s score.
For any tool to be effective and deliver value, it needs to be adopted by users. We recognized this and decided to implement a gamification system to encourage Backpack’s adoption.
We created a leaderboard that showed the progress of each team and began mentioning the scores of each group in meetings. This approach worked well, creating friendly competition between teams and incentivizing them to prioritize activities related to Backpack items. As teams began to compete for top spots on the leaderboard, adoption of the tool and standards increased. Engineers were more motivated to improve their components and ensure they met the required criteria, as they knew their efforts would be recognized.
In this way, the gamification system helped to encourage the adoption of the new tool and process — which ultimately led to improved product quality and reliability.
Backpack is more than a tool to verify RFM standards. We use it for discoverability, security, and to check other rules. It has become a place for developers to aggregate information and access all available resources.
The lessons we’ve learned are to do things that don’t scale first, then solve the current problem. Once you scale to try to automate as much as possible and eliminate subjective judgments, adoption is very important, and turning it into a game may increase your success rate.
It can be hard to imagine any team, let alone an engineering team operating without team leads, but it can be done. By moving away from traditional leadership hierarchies and towards a more collaborative environment, teams can uncover their full potential and develop innovative solutions.
PandaDoc serves more than 50,000 customers each day, and we bear a huge responsibility to provide every one of our users with reliable service. Our application consists of more than 100 services, plus dozens of front-ends developed throughout the last ten years. When working with a system this complex, an engineer can't track every component manually — this is simply beyond any human's limit.
This article will give you a glimpse into how PandaDoc manages reliability and how we improve it continuously.
In a large-scale distributed application, infrastructure is hosted in the cloud. Multiple programming languages are involved, databases, message brokers, cache servers and other open-source or vendor third-party systems, and many services interacting in complex chains of logic. Minor disruptions in one service can be magnified and have serious side effects on the whole platform.