Distributed Load Testing Using Argo in Kubernetes ( Distro )
Introduction
Distributed load testing is a vast topic; this blog is about executing distributed load testing on Kubernetes leveraging open source tooling at Kubernetes. When Application Scalability, Stability, and Availability have a high bar, we validate this bar via scaling and distributing load test infrastructure. We have performance testing best practices and tools to help facilitate this infrastructure, similar to other companies. We used to have some in-house solutions along with Jenkins as one of the distributed testing solutions.
Problem
We were going through a journey from the data center to cloud (AWS); hosted most of our performance testing infrastructure on Jenkins base on our Datacenter. We used to have huge load clients for this Jenkins instance, which handles the load via Jenkins's job. We lost that setup during the migration, and we had the opportunity to find an optimal solution. The teams can’t wait, as they need to push the product features in a reliable, performance manner. This time, our Developer platform group started building the Kubernetes platform for Intuit. We have to develop this platform for the next-gen scale so that TurboTax, QuickBooks can onBoard and scale. How will we test that scale? We were looking for a solution, distributed scaling test infrastructure.
Approach
During that time, we evaluated internal and external tools. We had a few in-house solutions which were supporting Jmeter base performance. Then we had few solutions where the team was leveraging AWS and creating load. Last we had a solution on Frontline ( Gatling) for a few project teams. We looked first at performance testing programming and picked Gatling as we have already seen benefits. We had to solve for latency; hence we want to have test infrastructure close to the application and cluster. We were looking for a solution for Kubernetes and leveraging containers, as our application will be scaling in Kubernetes as a container. We did FMH ( Follow me Home) and talk to a few engineers who used to execute these performance tests. We identified specific gaps in existing solutions, which were
- Self Service
- Cost ( Vendor vs. Open Source)
- Scalable
- Container Native
- Execution challenges
Options
We did many POC with few solutions, and during one of our engineering days the last year, we explored argo, test-container, and gatling. We had this solution working and used for almost a year, but very specific to our platform need. Last week, on engineering days, we built this solution generic and shared it. Now, this solution can be used by anyone, as long they use Kubernetes with Argo. This name, DISTRO, came during this time, which means Distributed testing using Argo. The distro is a container-native solution, build on Kubernetes, and supports many technologies beyond gatling. Over the last year, we have many customers using them and providing valuable feedback. Also, we are our own customers. We have to develop and certify this Scalable Kubernetes platform. We are doing all our reliability testing using this solution in a centralizing manner. This blog is about the journey, and that helps others if they want to adopt it.
Distro
Everything is code; this was the thought process from day one. Most of the tools we have some solutions, but not everything is code. This is a solution where all the parts are in code.
- Self Service, make everything available as the paved road to the team to adapt and use as per their need.
- Cost should be optimal as we use smaller pods for load distribution, leveraging many Argo workflow cost best practices.
- Scalable, this infrastructure runs in a Kubernetes namespace to create instance group, add spot v/s fixed instance type, change the replica set min/max, and implement HPA (Horizontal pod scalar) and build reliability using PDB for pods.
- Technology agnostic, Building a container image for various performance programming. Not only gatling but more like node or python or go base container image.
- There was a specific requirement for reporting, live dashboarding as well, as engineers adoption and onBoarding. So we are using Argo's existing capability, with reporting reports logs to centralize the log processor.
Test Container
There is a lot more going on testcontainer, and there is a working group and many useful resources available. However, Distro works on the test-container concept but more towards distributed testing for load, performance, and stress.
Building Test
You have to know writing performance tests, the Throughput, steady-state, ramp-up time, etc. are core concepts for any performance programming. We have to ensure we use and leverage those core performance features in our test to build on those when we want to scale. So we have created a reference template for creating a Gatling base performance test, where all the performance parameters have been exposed as a command-line parameter. We build this test suite using mvn. The attached code is a reference to kickstart the health check service endpoint. So we have developed a test that can validate a health endpoint and return True if the service is healthy. Exposing endpoint via command-line option via test is the key so that from the same cluster or external cluster, we can hit this endpoint via Distro. Example :
mvn clean install -Dgatling.simulationClass=Echo.EchoSimulation -DpeakTPS=1 -DrampupTime=1 -DsteadyStateTime=1 -Durl=http://localhost:8080 -Dquery=/health/full
Building Test Container
We are working on Kubernetes and do understand the importance of a container. Hence, we have built this test code and wrap in a container, and exposed the above parameters. We used docker, and we started producing the container image; check the Dockerfile in the attached code for reference. As the application code is evolving, similar to the test code, as both, we kept in the same repo, we produce the test container image as part of the pipeline. Later we added many other container images, including functional and non-functional.
Argo Workflow
Initially, We were working on Argo workflow, and during such interaction, we found that joining the container logic with argo workflow could help solve scale problems. Most of the requirements match argo features, but we have to make everything end to end working. Argo provides step function with looping capability, which helps us to grow and scale. Argo leverages the core Kubernetes principle, so leveraging this infrastructure and scaling on Kubernetes is working well. On top of Argo, do provide cost optimization best practices and resource optimization during and after the tests. Please refer script or Jenkinsfile for more details. Example :
argo submit gatling/perf-infra-wf.yaml -plimit=1 -ppeakTPS=1 -prampupTime=1 -psteadyStateTime=1 -pbaseurl=http://localhost:8080 -pquery=/health/full
Bring Your Test Container
We have created a reference container image to kickstart; those teams who want to opt for this need to modify their test code and modify the argo spec file. We have generalized the argo spec file, and many teams started building on top of this. Whoever wants to onBoard and leverage, begin with the base state, and build incremental functionality for their application. Argo spec leverage them and execute them via command line parameters. This time, we had hit the aggregation reporting issue, combining many reports in a single report, which we solved after creating another image for merging the results from various pods, which was later added as part of the argo step. Example Images:
- name: appTestImg
value: "distroproj/gatling:latest"
- name: awscliGatmergeImg
value: "distroproj/gatling-merge:latest"
Load Distribution
Scaling Vertical v/s Horizontal
We have setup Distro infra in a namespace, aka workload. We have assigned a matching type, and we could scale that machine from large to 4XLarge. So a vertical scale has been taken care of with a bigger machine type. We have created 1 GB Memory and 1 CPU per pod for testing on the node, tuned based on throughput. We established the benchmark, how much load a pod can generate, keeping its CPU/Memory/Throughput. We were able to generate a 10K TPS test via only six pods. As we started onBoarding, more teams and service needs for horizontal scale came; we leveraged this via pod replica set on workload/namespace. Pod replica controlled by the HPA, after configuring for CPU/Memory parameters.
How to use
So we created a perf infra file using workflow. Added container images for test and aggregation, and provide parameters for performance, load. We started executing via argo commands, which later we converted into Jenkins pipeline. We found that the TPS calculation is a little off than regular performance testing, as we are distributing the test load on the pod. In standard cases, the test loader used to take a full load. In our case, we distributed across a pod. On top, we have created the test scenario based on business logic, where we give read API 5 times load than CDU ( Create, Update, and Delete) API. So below of our calculation become standard —
Overall TPS = (Read API + CRUD API)*(no of pods)(TPS per pod)
Self Service
We are giving everything in code; engineers will use and leverage it further. We have provided Distro infra, as well as a container as a reference template for a health check endpoint, as that is the minimum you need. If you deploy any application in Kubernetes, Kubernetes deployment has readiness pod mapping to the health endpoint of that deployment. We have created a declarative pipeline to use it for smooth execution via Jenkins library, without copying the pipeline code. We created a way developers can execute and upload this data to S3 or data lake for historical trends and data. Proving a few parameters, you can increase the load and distribute it among many pods. This has been shared in the Argo community.
Challenges
Scaling problems
Scaling problems are hard to reproduce and took time to fix. Most of the problems we found and solved are exactly matching with the application scale. The first problem was how to ensure during the test pod won’t be impacted and resilient. Hence we introduced the PDB ( Pod disruption Budget). Second, We were using spot instances to save cost, and often spot instances been cleaned up, and our test become absolute. Hence we added the no-spot option so that we will always leverage fixed host. Spot instances usually are cleaned up by AWS without giving long time warnings. Third, as we were distributing the AZ load, we found our cloud provides do AZ rebalancing, interrupting the pods. So we opted no AZ rebalancing, which ensures that during load disruption till we finish, we won’t clean up the nodes from the given AZ.
Report Aggregation
When we started executing the load across many pods, the first problem was that we might not finish the test simultaneously. Second, even the test finish at a different time, how we aggregate them and create a consolidated report. Gatling, like many other tools, provides a way to aggregate many executions in a single report, so we build on top of this logic, make another container merge image, and merge the execution results. Last, In the Argo workflow, we have added the step to sync and execute report aggregation once all tests are done.
Report Storage
Once we started getting the aggregated report, we found we can’t store the report on the pod. Also, for doing aggregation, we need a place to do the aggregation operation. We leverage the Argo s3 feature, supporting s3 integration Out of the box from the namespace. We started pushing all the report data to s3, and during the aggregation, bring them back, create consolidated reports, and upload back to s3. Before opting for s3, we did explore all other options, but none of that was working for scale.
Reports accessibility
We started storing the results in S3, but s3 is not public, and we don’t want to make it public either. Also, this workload has special access requirements for the s3 bucket policy. So we have created onBoarding and started adding workload in bucket policy. Also, we have added a read-only user in Jenkins, which could fetch the s3 gatling report. This was a time when we thought of keeping the reporting server as centralizing v/s distributed. We opted to centralize, but then someone has to manage it. Hence, we picked the distributed option in Distro. We also simplify this process on the cluster and AWS account via similar aws read-only user access; check out this to know more. Now, you can have as many Kubernetes clusters, and you can submit various distributed loads. With the Argo workflow feature of storing artifacts, you can always download them via argo UI for single/multiple executions.
Integration
Pipeline ( Jenkins)
We have created Jenkins declarative function and, via kubecontext, submit the argo workflow on infra namespace. This was we are invoking argo via Jenkins pipeline. Most of our execution is happening via pipeline only.
Chaos ( Litmus )
We have implemented the performance test with chaos test (litmuschaos) to ensure we have full visibility during the chaotic operation. ( check out this blog for more information). This execution also, we have executed via Jenkins pipeline.
SLI/SLO (Keptn )
We have done integration with keptn, where we have used one of SLI providers to fetch the server-side metrics. We have to define SLI (Error, response time, and throughput) and configured SLO ( error 0.01, response time 50 ms, and throughput 5000 Txn ). Based on our test types (small, medium, and large), we have manipulated the SLO for our execution. This is still a work in progress but seems to look positive results.
Cost
Actual Cost
We have found after doing many scalable tests with high throughput; we are within the budget. Our infrastructure, which is doing more than 10K TPS testing on more than 10–20 times a month, costs less than 500$; I am only including the resources, not the cluster cost, and Kubernetes resources cost. we are already supporting many services and infrastructures. Once the test is done based on HPA and replica set, this distribution comes to a minimum number of pods, which is as minimum as one. Now, if you go for any distributed testing tool, you have to pay $$$$ for a license, and then you have to build the infrastructure or use their infrastructure, which also incurs the cost. When we compared one of the commercial tools and our cost, we are way less in the operating cost and ZERO on licensing costs.
Best practices
With recent Argo enhancements on cost, usability, and debuggability, this entire infrastructure becomes live. We also added PDB to ensure the reliability of execution maintain. Kubernetes best practices for scale have been adopted for handling reliability issues of pods. Currently, with the declarative pipeline coming, every configuration is becoming YAML or JSON in this infrastructure. We also found for a few tests; we need 200 pods simultaneously; these use cases are solved by increasing the replica set or overprovisioning pods in the infra namespace upfront. This solution kept in mind for future technology; Today, we are using Gatling tomorrow something else. As long as we wrap the test code in a container and build aggregation around it, it could easily be supported by Distro. Most commercial tools in distributed load testing space stick to their propriety language and won’t support other technology.
Metrics
Test Scale
We are using this infrastructure for running 3 types of tests, performance test, scale test, and longevity test. First, The performance test is a standard test, where we create a 10K TPS ( Transaction per Sec ) load. This test executes on a specific service or namespace or workload level via hitting their endpoint; we call this a vertical test, impacting only one namespace. Second, in The horizontal test, we execute a similar vertical load across all the namespaces in clusters or many clusters; we used up to 200’s namespaces. Third, the longevity test ensures we can handle the scale on constant load for a longer time; we execute this test for 72 hours.
Distributed Load
Vertical, This test can finish in 6 pods and create more than 10K TPS loads where 50Million Txn are executed. The below diagram is fetching Prometheus metrics on targetted service endpoint, where the application is scaling.
Horizontal, This test somewhat takes from 10 pods to 100's pods based on load requirements. Each pod could take from 2 TPS to 5 TPS; we have hit 100’s Million Txn for this execution. This usually creates up to 50K TPS loads on the overall cluster, across 100’s namespaces where each namespace/ service-endpoint takes 5 TPS loads.
Longevity Test, where 100’s of pods distributing the load on 100’s for service with many days (3) has given real distributed load. We have crossed more than Billions of Txn via these tests, which tells the real scale of Distro Solution.
This pattern we have replicated on our 10s of the cluster, and they are serving the services running on the namespace in those clusters. Today we are using centralize with support, but with Distro, we are doing distributed now with self-service.
Benefits
This Solution is being used for more than a year for all kinds of distributed load testing for our Kubernetes platform. We have more than 20+ teams actively operating in a self-service manner. We were wearing customers/engineer hats, which resolved many product challenges, we have faced while building this solution. Commercial products take their own time for user requirements. Last, it is open-source and using Argo and Kubernetes. Currently, supporting few technologies like gatling and node can help any test-containers as long as it follows the above principles. Currently, we support AWS, but the same can we added for Azure or GKC, or any of the cloud providers. We initially supported one cluster, like that we now have more than 30 clusters using and distributing the load.
Team and Support
Thanks to the Argo workflow team, Jesse, Alex C, and Bala helped in this integration and resolved many issues. My team, including Phani, and Vijay, Last but not least, Navin contributed immensely to this project.
References
- Medium post using distributed load using Kubernetes
- Google cloud distributed load testing solution.
- About a Test container
- Get started on a project for Distro.
- Get Started on building a Test container for load testing.
- Argo change as a template.