Operating Kubernetes at Scale

Sumit Nagal
11 min readJan 1, 2021

--

Introduction

Intuit is building a Next-gen Kubernetes platform to host Intuit services and products. We have already seen how TurboTax is scaling in the last two tax seasons (Blog-1 and Blog-2), how we are learning and improving on the scale. We also shared how we are managing these 2.5 M lines of code via GitOps in kubecon 2020. This effort from platform, cloud, application and various teams helped Intuit Kubernetes platform to “Operating at Scale.” We are currently supporting more than 200+ clusters of various sizes. We are supporting on-prem as well as AWS EKS control plane. We currently have more than 2500+ services on onBoarded as part of the paved road. 1000’s engineers in the company consume this platform. Last but not least, this platform support various product line from micros services to monolith at scale for a product like TurboTax, QuickBooks, FDP, Mint.

Pillars

When we started to build this platform, the first challenge was to pick the reference microservice (Mentioned in the Application section) and then build the Test harness around it. The second challenge was simulating this microservice to solve our business/domain problem of Turbo Tax and QuickBooks. The last can we scale this microservice to represent a realistic load for our product line. This realization happened very early when we started on this journey, and we realized that we have to build reliability upfront before we onBoard our product line. We started gathering requirements and putting effort into building these 5 pillars, the platform's backbone. As part of this effort, we optimized our Application for performance and scale. This has been shared for micro-service optimization for our services. Two of the solutions (Chaos & Distro) being adopted by our internal customers and later open-sourced, though they have been used initially for platform certification.

1. Vertical Scale

What

We want to scale vertically, which means we have to optimize the microservice for performance. We want our service to perform optimally on a given resource type. We also need to see how big resource-type we can use—also, this resource-type how we can better optimize for several pods. When we started on our reference app, we are not able to scale beyond 10 TPS. We do need monitoring as well as reporting to measure and improve. Our Target TPS is in 1000’s of TPS.

How

Most importantly, we have to see what better throughput our microservice can give; with 10 TPS, we have to make some fundamental changes on GC (garbage collection) usage. Once optimized for GC, it started using caching with in-memory DB, resulting in better performance numbers and higher throughput. We optimize our test harness to use CRUD operation based on customer interaction. Later started doing mocking to handle the user access dependencies. After that change, we moved on scaling resource-type from Large to 4XLarger. We have to optimize as a single microservice as a pod; then, we need to see how many pods we can use with resources/limits in a given resource type. After these changes, our microservice started to provide under 50 ms response time for throughput of 10K TPS load. We do use HPA to scale these resources for a given service. This becomes our standard test for this platform.

Benefits

We moved from 10 TPS to the top 10K TPS; this standard test started a catching issue if something changes on the platform. This test covered most of the single service scale use cases; this load gives confidence; we can scale a single cluster service. We identified many limitations on a platform that we worked on and resolved. We have handled the SIGTERM fixes and the Burst mode scale for a single service during this time. Later this work ported to EKS; with the benchmark, we can upgrade the EKS capacity for this platform need. Today, we can scale a single service on more than 500+ nodes with 30K TPS on the cluster.

Learning

HPA Burst mode

  • We have to scale beyond HPA, as our scale requirements are very high. We want to scale in 10 min, beyond 100 resources, which the conventional way of HPA won’t provide. So we added logic to increase the scaling based on load increase.

Node scale

  • Getting a single new node and attach to the service takes 5–7 minutes. Overprovisioning nodes help in scaling.

Auto Scalar

  • The limitation with pod anti-affinity increasing the CPU limit helps. You should have a benchmark for your cluster size with Auto scalar.

2. Horizontal Scale

What

After achieving the vertical scale, we found that customers do not use the cluster as we certified; they have many cluster services as workload/namespace. Also, Each service has different scale requirements. So we need a way to have various micros services in a cluster, and each scale vertically and on cluster horizontally. This was a challenging problem, as we need to create everything dynamic, not only the service but to test those services test harness. The infrastructure used to run those tests also needs to be dynamic. So we have to solve for multiple services, test against those and the infrastructure used to create scale. Last, we need to be sensitive to cost as we want to bring resources down.

How

We found that we can't create namespace parallels due to our setup's nature as we started exploring options. Due to security and compliance, we have pushed SSO for the creation of any micro-service namespace. We need a fixed infrastructure that could change dynamically. That is when we leveraged the ArgoCD and kustomize and built a dynamic infrastructure that can scale up and down based on GitOps. We have created a cmd-line utility that can pick the dynamic values and create the resources, synced via Argo-cd leveraging Jenkins pipeline. We started building 50 micro-services on a cluster by scaling them fixed/randomly and making our DISTRO infra dynamically. We have picked most of our clusters to use cases,s which use 20 to 200 namespaces on given clusters. Now we can create services, which can scale vertically and later horizontally. Also, the services scaling horizontally.

Benefits

We hit the cluster capacity and Prometheus scale problem during this time; we also found auto scalar limitation for scaling across the clusters and updating its configuration. We can now scale 200 namespaces with 900+ nodes for 50K TPS on clusters; this becomes our main exit criteria for major releases. We spent tirelessly on finding all the bottlenecks to ensure we have a stable platform.

Learnings

AWS scale

  • AWS AZ scale and rebalancing create churn; avoid it if you don’t have resiliency build on your service.

Synchronizing

  • Scaling in the cluster for multiple services on a single AWS account is challenging. Have to follow Overprovisioning as well as Burst mode mentioned above.

Prometheus

  • Scale for cluster hit a limitation; increasing the machine type will help only vertically. Found later Thanos is the solution for Scaling horizontally.

Readiness Gate

  • There are many positives to implementing it, but the cost is delayed in the application uptime and joining the service.

3. Reliability

What

Reliability is such an important factor for this platform; we had many production incidents in the past, present for application resiliency. Building platform resiliency is an ask from this team. We have been working on chaos-solution for a while, but the more important thing was the use cases. So we defined the use cases for application, cloud, and platform. We also set up the expectation from those use cases while building the chaos tests. We have to pick a cloud-native solution, as well as support our use cases. We invested heavily in chaostoolkit, and we want to continue to use that as it is open and customize it for intuit needs. Another important dimension is automation and ease of adding the use cases.

How

Thanks to Argo, Litmus-chaos, and chaostoolkit, we have a solution that can validate the Application resiliency via creating various chaos on the application. We have to build many AWS chaos use cases to handle real customer use cases for ec2, ELB termination. Most importantly, we have to build kubernetes platform resiliency for various components. We leveraged argo workflow with Jenkins to execute them in an automated manner. This work later, we open source and make it available for all our 100+ clusters via kubernetes add-on.

Benefits

Our platform becomes resilient; we identified many issues and fixed them one by one. Some of the findings are useful, as we know the limitation of cloud and platform. We have created a robust chaos suite that validates every release and ensures no regression has been introduced into it. Later this also becomes part of the exit criteria for any major release. We are now open to adding more functionality on the platform and adding relevant use cases for its resiliency. Application resiliency has got many improvements, especially when we moved to EKS, and our application becomes more resilient for handling chaos. More information is shared on this blog.

Learnings

SIGTERM

  • Resiliency is important for handling application interruption. Please handle application resiliency.

Node

  • Cleaning and draining have to align with resiliency and health check to avoid failures. Please ensure your resiliency shutdown hook matches the ELB health check.

Error 502/504

  • Application 502/504 is a reality; try to minimize it as much as possible. Handling both errors differently in Application, which comes with a new node, draining node, rebalancing AZ, or pod interruption.

4. Upgrade

What

Intuit has security compliance, and we can’t keep the AMI (Amazon Machine Image)image for longer, so we usually rotate that AMI image based on security guidance. This is a major feature of our platform offering, as no one needs to make any change and micro-service run on the latest and greatest compline image. As we started more rigor on the above testing, we found that our customers are experiencing upgrade issues. The main problem is that cluster becomes BIG, and when we do upgrade, it usually takes a lot of time to upgrade. During the upgrade, a new node coming making 5XX issues, if not handle resiliency correctly. Also, sometimes AWS AZ does some balancing causing delays in getting nodes in the specific region due to capacity problems. So we need to identify them and handle them before going into production so that our customers can better prepare for real customers.

How

We have two types of upgrades, one without image rotation and another with image rotation. Image rotation upgrade usually takes a long time, as Kubernetes will drain the node and then evict. We usually create upgrade with Zero downtime, so we need to make application resiliency to process the existing requests and stop accepting new requests. So we build this use case on the above horizontal scale cluster, with 1 TPS load to simulate a real customer scenario and trigger upgrade. Later, we extended this work to the biggest cluster and articulated the right time for upgrade and impact due to upgrade. We also found the Kubernetes component, which needs scale to handle this if they do not scale horizontally.

Benefits

This suite becomes a needle mover as we have answers for every cluster upgrade; we made changes for scale and other configurations if we need a Zero downtime upgrade. We can manipulate customer clusters and started sharing expectations for cluster upgrades. We also picked every release that should go via our dog-food clusters before hitting pre-prod and production clusters. We found and fix many platform-specific errors, which could impact customers if we don’t have them. Due to the small-scale test load, we ensure any other system component could impact service level SLA.

5. Stability (Platform Health Metrics)

What

We have found that the issue lies somewhere else when we target a specific scale. Most of our testing solutions target specific SLA, which breaches then we dig deeper to identify the problem. This brought another use case where we have to build platform health metrics, which proves no matter what type of activity you do on a cluster platform could be called healthy or not. Though we do have those checks in place during that time needed an hour.

How

We worked with platform engineers and identified all the Kubernetes components, which were essential for running the clusters. We identified the criteria to prove if they are healthy via adding health checks to their end-points. Sometimes we picked no instances for those components to be called healthy. Slowly we have added more than 30+ use cases. When executed, can call the cluster is in a healthy condition or not. We added this as part of our reliability suite and execute it before/after these big suits. The reporting for this is a comparison report for pre and post-execution, revealed many opportunities. Later that becomes standard practice for the above four pillars.

Benefits

Due to platform stability, we learned a lot about every component and its behavior. When Memory/CPU compromises crash happens, those are rare cases we found. Now, we have complete visibility of our platform. We also know the impact of each pillar and how they impact overall platform stability. We have raised more than 50+ issues during this process, bringing awareness and making this platform robust. These efforts result in many intuit products and services seamlessly onboarding to this platform with confidence.

Quality Gating

To Measure the above pillars, we have to define the strong Quality criteria. Through our observability platform, we fetched these metrics. Later we experimented with Keptn base SLI/SLO to handle them dynamically. As we have most of the wavefront data, we are working on a wavefront extension to make Quality gating well integrated.

1. Status Code

  • Not more than .01% error / 5XX (500/502/504)

2. Response Time (Percentile )

  • Response time for P90 – Below 30 ms
  • Response time for P95–Below 50 ms
  • Response time for P99–Below 100 ms

3. Platform health Metric

  • 100% success

Test Execution

  • We first build the use case for this micro-service to test it; this is near the real application we use based on the graphQL base schema. Given this service, we have CRUD operation; we build Create and Fetch/delete/modify API use cases. We have built-in test harness create as 20% calls, Fetch 70% with delete/modify as 5% Each, based on usage of this API in production like usage. This has been wrapped in end-2-end execution via the Gatling framework. We used Distro for distributed load generation to execute tests. As you have observed, the scale requirements are very dynamic, and the service scale horizontally/vertically; we need to test infrastructure to follow the same pattern. The distro is a solution that can scale the test infrastructure on Kubernetes. It uses the container image of test code similar to the service use application container image. This work was orchestrated via argo-workflow and executed via Jenkins file. Please check out this blog for more information.

Application

  • We have used the spring-boot application on graphQL/REST. This application uses the domain schema, which provides CRUD operation support on that domain. The CRUD operation is about adding a resource and manipulating the resource in the business domain. As an example, we will add a document and manipulate that document stand-alone. We have added in-memory DB with this application, and to fetch the resource data leveraged caching.

Team and Support

  • Thanks to Shri, Sheldon, Gagan, Mark, and Venkat, my team Members Vijay, Navin, and Anu. Special Thanks to Ed and Lauren for reviewing the blog and provided valuable feedback.

--

--