This year “2020”, we have seen some real chaos. Though we are coming out from this pandemic, one thing strikes me, have we thought about this Chaos? If So, then how to build resiliency for day to day things we are dealing, today? This thought process brings aspects and focuses more on being ready for Chaos on software, products, and platforms. Chaos in the human environment is beyond imaginable, though we could do things we have control. We can put extra effort into making things reliable. How can we achieve this? Chaos Engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions. Chaos Engineering will happen in a Control environment, where you should inject Chaos and ensure the system state remains stable. I am sharing my learning on this journey, from Application, cloud to the platform ( Kubernetes with Keiko). As I mentioned, it is a journey, so there is a lot more to learn and achieve.
Starting on Chaos
I started on Chaos engineering a couple of years back when we used to have chaos monkey and Symbian Army solutions from Netflix executed on AWS. The main focus was application resiliency and implementing a Hystrix pattern. Later, we explored more Tools to identify the Hystrix pattern’s opportunity for Application (Mentioned below). We were later reproducing those chaos scenarios for Application and kept ourselves engaged in chaos work. It is essential for chaos engineering that resiliency should be built-in to Application. They were stand-alone solutions, and bringing under one solution was tough. We did many POC on Chaos and tried various open-source and commercial tools in chaos-engineering. We looked at our chaos use cases as well as implementation with security and compliance. We found the customization for our need with a security compliance commercial solution is not viable. We were looking for a robust solution to leverage core functionality, contribute our requirements, and customize our needs.
Chaos Game Day
We had hosted a Chaos engineering meetup. This meetup brings chaostoolkit Tool in attention, which is an open-source solution as well as highly customizable. ChaosToolKit follows the Principles of Chaos Engineering. We did the initial POC and found it useful, and it was open for our needs with highly customizable. We were working on a small scale with various Teams to adopt this solution. we did a GAME Day on Kafka reliability during this phase ( See Below Image). It was just an Idea Phase.
As we started working on this framework, we started using many of their plugins. Though we realized that we need to customize some plugins, we also needed to write our customize extensions as they were not there. However, we were incrementally making progress on our work, framework changes, and new plugin support. We had started and were using 7–8 plugins , we have added 4–5 new plugin, and we enhanced Kubernetes and AWS plugin for our needs. AWS and Kubernetes plugins we made changes, and we are exploring options for its contribution. That was the time we needed to understand customer requirements, as it was beyond POC.
Chaos Tipping Point
The first turning point was during Innovation days 2018. We presented the chaos engineering mindset and got second place; The second turning point came in 2019; Intuit was building the NextGen Kubernetes platform; as part of that, We had onBoarded many teams on Kubernetes clusters. Our Customers faced a few incidents related to resiliency during that time, which pointed to further investment in reliability work, especially on our platform. As this is a core platform, its reliability is most important. We spent close to 2 weeks in the war room, built all the possible use cases, and validated our platform ( Kubernetes with Keiko),. We partnered with a few teams during this time and shared our idea for consuming chaos Solutions. Chaos is getting essential and vital; we were going on a mainstream project from a pilot project.
Chaos as a Service
During the war room, when we validated chaos use cases for kubernetes, how can we automate these chaos use cases and then how we will execute it as part of platform certification, was open questions? As we are using Kubernetes, we did POC and built a python based deployment service, which runs this chaos framework in a container along with Application. This Chaos Service interacts with Rest API and uses a cluster role to execute the chaotic operation. Later we improvised it and created a node-js based UI so that the Team can adapt quickly. We want to eat our dog food; we want to roll out with few teams and validate chaos adoption. Some customers opted and like this solution in their pre-prod environment and are using it to date. We were building a product for broader adoption.
Chaos on Kubernetes/Cloud
Chaos as a Service, Solution was working well; PM brings another dimension for rolling out this solution across clusters as teams want to get reliability for their Application and services. Reviewing existing solutions with the security team, we found compliance and rollout challenges. Setting up access and permission via service mode has feasibility, security, and compliance issues. On top of that question came this is not a Kubernetes Native solution, how we roll out across 100s of the cluster. So we started searching for a solution in kubernetes space. Initially, we built two solutions via the controller ( AWS and kubernetes), but then the question came, how we will port our framework and use cases. Effort and work would be more for porting. Hence we explored open source again and did many POC before choosing Litmus. We pivoted and reconsidered the solution for current needs.
Chaos Architecture for Kubernetes
Before picking Litmus, we did technical as well as Use case feasibility Analysis. We decided Litmus as it has a plugin architecture, community-backed, and Kubernetes native support. More importantly, it is open-source, which now becomes a sandbox project in CNCF. Our first POC was looking to port our current work, which has more than 50+ use cases. We want to minimize effort hence build this project to interact with kubernetes and chaostoolkit. We look plugin infrastructure as part of Litmus and the CRD based approach, which creates custom resources; in a couple of weeks, we were ready to execute our first kubernetes use case on Litmus. We presented in Litmus community and sharing our progress every month. I am a maintainer for SIG-Integration under Litmus. We are incorporating more use cases and building on core solutions for cluster and service owner.
Chaos using CRD’s / CR
When we started looking at Litmus, we wanted Chaos use-cases to execute via kubernetes native way. We have to build a container-native solution to run the container with our code, leveraging the custom resource and CRDs. The integration of the framework requires us to orchestrate and break our service layer to python wrapper. We build resources that match and invoke our framework, using a litmus runner and operator to execute this container. During this integration, we found many challenges, which, with their community’s help, were resolved. Custom Annotation, overriding custom resource permission, passing IAM roles for AWS are few contributions at the Litmus level we did, to support our custom use case. Now with these changes, we can execute the Kubernetes experiment as well as AWS experiments. Please refer to the pod custom resource for the engine and experiment.
Chaos use cases
We divided our customer persona in, cluster-admin, and service owner. Cluster admin has more permission for cluster-level Chaos, where service owner has limited permission only on their namespace to interact with his Application. This persona will do Chaos on the platform, Application, and Cloud, using our orchestrate code called the chaos tool kit framework. We are executing Kubernetes, Application, and Cloud base chaos with platform health and service uptime. We had three buckets of use cases that cater to Application, platform ( Kubernetes with Keiko), and cloud (AWS). In these categories, we divided Tier 1,2,3 test cases based on their feasibility and execution frequency. Please refer to the git, doc for more information. We have contributed kubernetes pod, microservice, and AWS ec2 chaos use cases till now
We are using the Argo workflow for test infrastructure execution. Chaos on Argo-Workflow is a solution we build in house and share on the Argo community. This solution helps us to automate and provide a way to maintain infra as a code. Similar work we extended and added as part of chaos workflow and contributed back to open-source. Later we have shared how we are using declarative pipeline and executing via Jenkins pipeline. Please refer to the git, doc, and video for more information. We are expanding the solution to the next level, with our expertise in Argo space.
Chaos and Performance Engineering
We take customer Failed interaction very seriously, which means we don’t want our service to be down. When we started doing chaos experiments, most of the chaos are stateful via our solution, Then we go and validate the uptime and availability of service. There was no stateless way to keep checking the service uptime when the pod or Application was down; though we are using HPA, we want to validate uptime during pod downtime. Hence we have introduced the Performance test along with Chaos. We leverage our existing performance infra solution using Argo workflow enhanced and included with chaos workflow. Now we can run Chaos as well as performance leveraging Argo workflow on performance infra. This solution is being built on open source and shared in the Argo workflow community. We now have resiliency, measured in terms of end-customer experience or service uptime. Chaos workflow with Performance test is part of the end to end via Argo-workflow automation. Please refer to the git, and video for more information.
If you can’t measure, you won’t see the progress and improve. Though every team/company has its way of reporting, the same is true for us. Chaostoolkit gives us a report, as well as Litmus, will give chaos results and kubernetes events; though they are useful, we have specific reporting requirements. However, we had our reporting requirements build in the framework from the first day; with some small changes, we pushed all the data to Kafka-based Data Lake. This report matches and helps us derive results with other operations like code, performance tests. Today we move all our chaos results to a centralized data Lake and, through Kibana, build a custom dashboard, which provides information on chaos execution.
The latest feature in this journey is Chaos GitOps; there is a lot more information available on GitOps. We are following the GitOps for our application deployment using ArgoCD. Now, we started doing GitOps for Chaos too. We are working on a solution leveraging Litmus/Argo-CD, where Chaos is available on the cluster. Gitops will execute on namespace for experiments and roles. Namespace GitOps we leverager kustomize, which is sync via Argo cd. Application GitOps we use via annotating the Application using custom annotation support via Litmus chaos operator. Please refer to the video for more information; code will be coming soon; stay tuned
Chaos Team and Support
Here are engineers (Vijay, Navin, Anu, Phani, Gunjan, May, Ravi, Veena) working with me on this project. Special mention of Russ and Sylvain from chaostoolkit, Karthik and Uma from Litmus, who guided me in this journey.
Sumit Nagal , Principal Engineer , Intuit Inc.