Insight on JVM Tuning

11 min readAug 23, 2020

Introduction

The amount of information available on Tuning JVMs is overwhelming! There are many techniques and even more blogs published on the subject. Over the last few years, I have been working on monoliths, microservices, and, most recently, on containers on Kubernetes. I found many opportunities for JVM Tuning and, through much experimentation and analysis, learned how to achieve the optimal results. Unfortunately, there is no fixed formula for JVM tuning that works every time for every environment. However, there are some best practices that I found quite useful to get close to optimal results. That is why there is no fixed formula for optimization; hence I usually called JVM tuning as an “Art.”

In this blog, I give an uber picture across this spectrum of techniques for the full life cycle of JVM Tuning. We will cover upcoming JVM technology changes that bring new features and new GC algorithms. As memory and CPU become cheaper, different trade-offs are possible, JVM tuning is a moving target. Below is an Ideal state of Application where we have regular GC happening and objects being garbage collected at a steady rate. If your application Garbage collection is frequent, clearing objects from the heap and bringing heap memory to the base state would be ideal.

Heap usage graph generated by https://gceasy.io

Every Application should try to reach this base state. However, for most modern applications, there is an opportunity to tune.

Customer

We need to understand the customer needs and SLA for our products, we follow below, but they may vary based on your customer requirements.

Stability — We are always focused on customer FCI ( Failed Customer Interaction), ensuring the customer has the right behavior of the product and won’t see the error on their side.
Performance — Once stability achieved, focus would be on the optimal way to serve the customer, which we measured as TP99 ( 99 percentile of response time). We measured customer perceived performance.
Availability — Last but not least, ensure that our product availability should beyond 99.99% and should not have outages. The scale is the next step in this journey, where you can do vertical and horizontal scales based on product needs.

JVM

Every Product has different requirements and burning needs. However, most of the time, we focus on below, as they cover core tuning concepts and biggest needle mover.

Pause Time — Focus would be on Reducing the pause( suspension) time. Pause Time impacts the response time, aka Performance.
GC Invocation — Minimizing the Full/Major GC, will help the stability of the product.
Object Lifetime — Object lifetime in JVM, ensuring the Objects are not pushed to Major collection and most of them to be garbage in Minor. Object rate also needs to be controlled. Overall object churn rate, how many objects created and cleaned up. Object Lifetime brings overall Availability, as less Full GC Invocation, and lesser pause time.

Operation

We then need to consider how we can get insight into the above JVM parameters so that we can take corrective action. We need to look based on what specific problems we are facing.

JVM Monitoring — Any modern APM tool will help you to monitor this. We want to see the pattern of the Minor, Major as well as suspension time.
Gc.log analysis — If you configure your gc logs, you can view it with your favorite log analysis tools.
Heap dump — Heap dump also has two concepts, one with Live objects and second with making Full GC so that only objects left in the major have captured. Another concept for Heap dump is to attach to API or code base or stand-alone dump. Specific APM tools with Performance CI can help; most common is taking the Heap dump and analyzing it separately. However, there are so many different ways we can capture the Heap Dumps.

Problems

Though there might be many aspects of issues, as well as their intensity. However, we focus on below, as I believe making an impact on these will help across tuning space.

GC Pause Time — This is one of the first symptoms of the GC problem, and pushing to look into JVM tuning. Here is a useful reference on how to understand and handle it.

Memory Leaks — Best coding practices are an iterative process, though as an engineer, we are focused more on writing optimized code, Memory-leaks are reality.

Out OF Memory / Full GC — If we are not following the core principle of what and how to configure your JVM based on your application needs, we end up in OOM. Applications are Memory and CPU bound, and you need to use them wisely, this does have a significant impact on cost.

Heap usage graph generated by https://gceasy.io

Background and Future ahead

This Blog is an attempt to bring some background, where I started working on Java. I try to capture what types and trends change. I am sharing my experiences in dealing and continue looking for future changes.

Over the last few years, we moved from Series, Parallel, ConMarkSweep to G1GC algorithm for JVM Tuning, which has further improved with ZGC type of algorithm, which reduces throughput and pause time.
G1GC is by far the most common algorithm and widely adopted.
Java also moved from Sun to Oracle to Corretto with JDK 6 to JDK 8 and now JDK 11. Over the years, a lot of improvement happened, so picking the latest and greatest is always better, as it has its own benefits.

Action

Though there are many things we can do to improve JVM tuning, these are systematic approaches I have been using, which were quite successful. Though, I believe if your product is a customer-facing application where code is changing for new features, most of them are applicable.

JVM baseline — To start on JVM tuning, get to the base state of your Application and make it as a baseline. This means you know where your Application stands. You have to start somewhere, though, with My experience picking your algorithms (UseG1GC) and core Parameters (UseStringDeduplication,Xss).To know more about insight on JVM, you can add parameters ( Xlog:gc*:file, PrintFlagsFinal). Another theory suggests that starts from scratch, though this will take a longer time.
JVM Tuning — Once you have a baseline, you can tune it with other JVM Parameters based on the problem you are addressing. We have more than 600 JVM Parameters. Here I assume you are following the test and your production usage pattern while executing to focus on problematic API. If you are not focused on re-pro the problem, it would be tough to get to the real problem. This is an exciting blog with a few recommendations.
JVM Insight — Use tools like jmap and Kill to capture the Heap dump, few APM tools provide Heap dump taking capability. Taking a Heap dump also depends on the dump state, live or not? Also, we should know what problem we are solving from the dump, Minor or Major Objects . Are we looking for too many objects in the Young generation? Are we looking for more objects not garbage and move to Major generation? Or Tenured Objects are not being cleaned even after Full GC.
JVM CI — Performance-CI ( Continuous Integration) is part of code pipeline, where we look for Performance impact introduced by new code. Similar way we could build the JVM CI, where specific JVM impact will be measured via API and attached to your CI pipeline. If you implement that as part of Performance CI as another measure, you will bring changes as early as using the “Shifting Left” mindset.
JVM in Production — Most of the time, we heard that we see this behavior on production, not on pre-production. Reproduction of this requires similar types of production data as well as tests, which are close to real customer load. Finding the Memory or JVM problem, without knowing your sequence and usage pattern becomes more challenging. This is the concept of “Shifting Right”, where you simulate customer centric behavior upfront.

Tools

There are various tools and processes around JVM tuning. However, I found a specific pattern most of the tools support. Although there are tons of tools, and they bring value, you have to pick the “right tool for doing the right job.”

Capture

We could leverage open jdk as well as system tools to capture the Heap dump. Two best techniques are jmap ( jmap -dump:live,file=<file-path> <pid>) or add as JVM argument ( -XX:+HeapDumpOnOutOfMemoryError ) Or refer other techniques. Adding these parameters as part of the JVM argument is the best choice.
The way we capture the Memory we will understand the pattern of GC via logs, and use many of these tools to Analyze it. There is one more challenge here, the gc.log changes based on type of jdk and algorithm, so you have to use a corresponding viewer. ( GcViewer, GCEasy etc ..)
We do have specific APM tools that collect the Heap dumps ( AppD, Dynatrace ) and preserve the Heap dump for further analysis.

Analysis

Once you have a Heap dump you can use any of the tools mentioned to analyze the Heap dump. When you analyze the Heap dump as a stand alone dump use one of these. (Yourkit, Eclipse MAT and Heap Hero etc ..)
Another way to analyze the Heap dump via connecting via Memory profiling tools, and analyze with your perf test. Though not many APM tools have integration with Performance CI, there are few ( appd, Dynatrace ).

Automate

As we know, prevention is better than cure, it would be nice to have some solutions or best practices in place so that we can start acting as part of CI. How to proactively catch GC/memory problems right during the development time frame as part of the CI pipeline Here is an interesting solution available to leverage.
You must know what pattern you want to see, for a JVM problem. Try to focus on the pattern for your test, and identify the API, which is giving your objects spikes. If you can attach the test with API, it would be easy to find problems quickly; otherwise, we have to find a “needle in a haystack” if you are in monoliths. A detailed explanation on how to enable performance-CI attaching test with api and leverage Performance as a Self-service.

Cost

However, I will share my thoughts on cloud Cost reduction and optimize the resource in the cloud, especially in Kubernetes. I am now focusing only on AWS, Kubernetes, but the same is true for any cloud platform and giving uber terminology. Though the way we have detailed out JVM , similarly we have Memory and CPU break down

Tier Type — On cloud Platform you have the option to pick your subscription model. This is a core part where you pick which type of mode we want to go for, based on your usage pattern. Usually picked based on your budget.
Instance-type — This is a resource type, where we are hosting our Application. For AWS here are the options for various machine-types.
Node / Pod — How much Memory and CPU you provide to your Application is a key factor in configuring node/pod. Node is the higher unit that hosts the pods, which will contain the Application as containers.
Application — How much Memory you can give to your Application, typically mapped as (-Xmx or -Xms). This is the lowest unit, and based on this; overall cloud infra cost will be determined. So if you tune your Application, every phase will be impacted. Today this is the most critical thing which could be optimized by this “Art”.

Reward

Using these best practises we have improved and optimized monoliths, micro-services hosted in cloud and containers.

Stability — We got biggest benefits on a flagship monolith on handling the restart in production. Due to JVM tuning, we are able to reduce the Memory pressure on production, which used to restart every night. We reduced 95% on Major GC & 85% time saving on Pause time, and no daily restart on one of the production instances, we do have more than 100’s of such instances for this product in Cloud.
Scale — When we started building our next-gen platform on kubernetes to support 1000’s of services on 100s of clusters using argo gitops, we required to build a micro-service app for containers, which is as close to our production. Using the above best practices, we can scale from 10 TPS loads to 10,000 TPS loads on our reference service, after improving the transaction handling on individual service levels, which is a way below our SLA of the 50ms Response time with 0.025% Error rate.

Key Takeaways

Define your customer SLA ( Error, Response time, and Availability )
Define your application goal for customer SLA
Measure JVM insight with respect to application goal
Find the opportunity for JVM from insight
Implement the Best Practices for JVM tuning
Make the JVM changes, measure, and optimize further

Conclusion

Before starting on JVM tuning, you should know what you are trying to optimize? Then found where you focus on, Is it a production incident or new code change, or are you starting a new product? Is a Leak, Pause Time, or Full GC occurrences you are focusing on? Look into the JVM Monitoring and identify problems. Please make sure it is a scale problem v/s performance problem. Now, focus n how to reproduce using tests, customer use cases, or long-running execution. Capture details on JVM, so that it helps you to identify what to focus on if Leak then looks for recent code change? Implement some best practices as well as working on better GC algorithms with the latest java version. Make the changes, measure, and continue to improve. Don’t try too much for JVM tuning, as you have to think for ROI and Goal. I have also found companies are building solutions on JVM Automation and Optimization. Last but not least, read through reference and best practices, as you have so many tools and Practices available.

References

Thanks to Ed Lee, Mark Basler, Ram Lakshmanan and Andreas Grabner for reviewing and providing valuable feedback.

Sumit Nagal , Principal Engineer , Intuit Inc.