Building Reliability — Dealing with Stability

7 min readOct 24, 2020

Introduction

Reliability stands for “the quality of being Trustworthy” or “Performing consistently well.” You can achieve reliability in the application, and platform by building stability and via increasing availability. Scaling vertically or horizontally will provide the required resources in a timely manner to increase reliability. When you cross functional boundaries and enter the non-functional zone you will see reliability problems with the application. Scale, Stability, and Capacity problems are hard to reproduce, tough to find them, and constant effort is required to fix them. In the 3-blog series, I will cover each of these problems with a case study. In this blog, I am sharing my experience dealing with 20+ years old monolith as well as Test infrastructure, how we overcame stability issues, and build reliability. I will share my approach and the best practices on how we optimized and improved with stability & reliability, which resulted in the overall stability of 100s of production instances, as well as reliable performance test infrastructure.

Problem

When I joined Small Business Group a couple of years back, I found this strange that we used to restart our production system (A 20+-year-old monolith) every night. Restarting the JVM (Java Virtual Machine) became a standard procedure and though everyone knew there was a problem, the bandaid — rebooting — was never fixed. The Ops team used to maintain a CRON scheduler that executed most of the restart via script. This problem was dormant for years and looked like ops used to deal with it as a routine job.

High Level Architecture Diagram — Monolith Deployment Diagram

The main issues were, Garbage collection was reaching the max limit on running application on tomcat, many Major Garbage Collection happening, and had longer pause time on the application. Which caused the Application unresponsive or very high latency, Which resultant in OOM(Out Of Memory) at the end of the day. unfortunately, we didn’t have a way to reproduce it in a performance environment. We had to solve “restart the production system every day” as well as “reproducing it in pre-production”.

Analysis

During this time, we decided to tackle this hairy problem of Performance for all products, and we formed the “Performance Tiger team,” where apart from another workstream on performance, this was another important initiative kicked off.

We looked into our existing performance suite and saw we had an opportunity but enhancing that legacy suite was not wise. We were moving to a new API V4 for our monolith, and we saw the opportunity for improving and adding tests on the new V4 API as a Gatling based performance test. This V4 test suite was, working and complementing the existing automation on previous versions of the API version of product V3, V1.

We were using the Old Garbage collection algorithm, moving to the latest require JVM tuning. We had to do, Memory and Heap Dump Analysis to find the opportunity in the code and JVM side. So, we had taken heap dumps as well as execute performance tests with new JVM settings. we started taking dumps on the various production clusters and comparing them with the perf environment. We found gaps in how customers were using this application v/s how we were testing. Looking further at existing analysis on multiple incidents, we confirmed JVM Tuning and the test coverage gap.

When we looked into our test data, we found it was not close to production. Last time, we had built this test data couple of years back. By this time our growth had doubled and customer usage patterns were completely changed. We needed to add end customer flow for each API, leveraging customer usage patterns. Also, we needed to build the latest production like Data for our test along with new customer usage patterns in our perf tests.

Last, we found that we didn’t Performance CI setup, which could give us quick results and attach to the latest code commit. Our Perf test setup and execution were tedious, which required a lot of hands and coordination. On top of our results were not very consistent, and that brought less trust in our performance test, There was time we used to execute the same test multiple times to prove the issue was legitimate.

These all points to application stability, test reliability, and Data inconsistency problems, and we had to make changes on the Application for new JVM, need to work on Tests and its data side with automation, so get consistent, fast, and reliable regression results.

Solution

We first worked on our test stability, test data, and coverage. Sound thing engineers who were porting to new API started contributing to Gatling script; which we had wrapped in our framework. After that contribution, we had a robust test suite, which was increasing coverage and functionality with development work

We worked in close partnership with DBA and converted one of the production datasets for production-like testing, with full compliance from the security team. We port this data for mocking and build data affinity with our test with mock users. As the Data was consistent for every execution, we started getting more reliable regression baseline results. This Data we attached to one of the performance CI environments, where we had mapping to each API. We were leveraging this work in a performance CI environment by snapshot and restores scripts (In reference section).

We added our test for each API in CRUD (Create, Read, Update, and Delete) flow, based on customer usage patterns. (Example Create-20%, Read-55%, Update-20%, Delete -5%). As our test suite and execution become mature, we started doing JVM optimization with various combinations based on our Memory dump findings. We switched to UseG1GC, and added UseStringDeduplication as one of the parameters. ( blog with all the insights)

We used Spinnaker based deployment system, which used to push the code image to the targeted Performance environment hosted in the AWS environment. Once we verified them then, rolled out these changes in a phased manner on the production cluster. Below was our Performance CI Solutions

Performance CI Pipeline, with deployment and test

Production dump Analysis, implementation of new JVM settings
Production like Data solution for AWS base RDS system for Test and Baseline results. (See reference for DB snapshot and restore )
Test Enhancement for API coverage and integrated test with production-like Data
Performance regression before production rollout, using Performance CI.

Reward

Today, many engineers are not aware we used to have a daily production restart. After these changes, JVM restart in production disappeared completely. We reduced 95% on Major Garbage collection & 85% time saving on Pause time, and which resultant no daily restart on production for 100s of instances. We are right now moving this monolith in kubernetes and will share more insight into the future. A few of the above performance CI learnings I have shared in Dynatrace-2018 as “Shifting Left”. Though our main focus was stability, which was improved by JVM tuning, we got additional benefits on Performance improvement and Performance CI, By building stability in test Data, which helped handle customer performance issues later. Customer Performance issues work, I have shared in Splunk-2018 as “Addressing Customer Issue” talks in-depth about this. This work built the performance infrastructure to use production-like data for performance tests in a fully automated way, which is being used to date after many years. More important, this brought required trust in the performance test, as they become stable and reliable. Engineers taking issues more seriously. We were finding more issues on performance and that becomes our gating mechanism for production, widely known as the “Shifting Left” approach. Few of the recommendations given by us become standard and other projects stated leveraging it. Developers started leveraging this on their local setup, for any breakage change. Last but not least, we had fewer production incidents.

Zero Restart in production due to JVM memory.
The Ops team become more productive.
Stable test with more visibility on outlier API or Code Change.
Production Like Data for Performance Testing.
Faster and Reliable Production releases.

Now we have our production instance stability 24*7, for 365 days.

Team and Support

Here are engineers who (Phani, Ravi, Prashant, and Kiran) were working with me. Special thanks to Mark and Siddharth for reviewing this blog and provided valuable feedback.

References

DB Snapshot Script ( Link )
DB Restore Script ( Link )