Erick Willis October 5, 2023
Containerized services are essential to serverless technologies. Containers make apps portable and consistent through their development, testing, and production environments. Lightweight and easy to deploy, they’re the key to increasing security without adding resources, and scaling up or down based on demand.
At Nuvalence, we’ve seen containers drive innovation for all kinds of clients. Discrete manufacturers, telecoms, financial services – any business that has to handle increased traffic without downtime. Wherever the fast pace of production demands automation of real-time data streams and orchestration of interactions, containers bring agility to the event-driven architecture. In key moments, they’re top performers. But there are areas where container performance can run into issues. It’s critical to identify and resolve these issues before they turn into reliability concerns and impact your system’s long-term scalability.
We recently helped a client address reliability issues they had been experiencing with Java in their containerized environment. Load times that typically took a few seconds were taking ten times longer and creating a poor user experience. By tuning Google Cloud Run we were able to fix these performance issues and restore the trust of system users. Cloud Run is a great platform for standardizing and maintaining the highest level of security without needing to worry about managing servers or the compute infrastructure – and without compromising on performance. In using it to focus on and resolve performance concerns, we’ve gained some valuable insights. In this blog post, let’s look at how we used Google Cloud Run to understand container performance issues, ensure high availability, and enable this organization to confidently scale processes.
The Issue With Startup Times
If your business is fairly new to serverless technologies, and you are more used to running services on prem or in a static compute environment like virtual machines, Cloud Run may provide a relatively seamless pathway to serverless. Because of the ease in deploying and starting services, you get the benefits of autoscaling based on load demand without much difficulty for your engineering teams. It’s a container platform that lets you package, build, and deploy your services in much the same way as you’re used to, using many of the same technologies, such as Java and Spring Boot.
While one of the more popular programming languages that features as a linchpin of real-time interactions for many businesses, Java is not exactly well known for its performance characteristics. Over the years, it’s seen various pre-compilation improvements that have resulted in faster runtime performance, but at the expense of start-up performance or with, at minimum, very small pauses during execution while a ‘hot spot’ is being optimized. This was sufficient for years – decades, even – but in the age of serverless computing, it’s a significant hindrance, as it results in ‘cold start’ latency.
Cold start latency is the pause that an application’s consumer experiences while waiting for a new node to spin up and become available to serve their request, such as an API request. Cold starts also manifest when an application has to scale its number of instances to handle the incoming load and be able to efficiently process requests. Depending on the environment and startup time, it can range from a second or two, to minutes, or theoretically, far longer. Relative to a few seconds, even a delay of several seconds more is not good for enterprise applications, as it can rapidly create a pervasive perception of poor performance among an app’s user base.
In working with our client, we observed that some of their Spring Boot applications, combined with largely default Cloud Run configurations, experienced startup times approaching 40-50 seconds. We also noticed other strange behavior. At seemingly random times we would see our latencies momentarily spike, and several new instances of our services spin up, all without a significant change in traffic load. Generally, these new instances numbered somewhere between 10-15, but in some extreme cases we observed more than 50 new instances spin up, only to completely spin down a minute or two later. This would happen periodically throughout the day, and didn’t seem to be related to traffic load. For example, the largest instance spike that ever happened was at 7 AM on a Sunday, when the only traffic on the system were our standard health checks.
Diving into Cold Start Mitigation
To better understand cold starts in Cloud Run, we had to first deep dive into how client requests are processed in the platform. In effect, long cold start times have some specific ramifications for container scalability and performance in Cloud Run, including:
- New client requests beyond the capability of existing instances to serve are put into a queue, waiting on a new instance to be ready.
- When this queue grows, Cloud Run interprets this as additional demand for the service, and will create new instances to handle the new load – even if all of the requests in the queue could be satisfied by a single new node.
Additionally, there are two known issues to be aware of:
- Cloud Run instances may be terminated at any time, and may cause you to temporarily drop below your configured Minimum Instance count (including down to 0).
- Requests that are in the queue for more than 10 seconds may be terminated, returning a 429 (Rate Limiting) response back to the client.
Because of these limitations, if your business is working with Java in Cloud Run, it’s critical to optimize the startup time of Java applications and be proactive in order to mitigate any cold start issues.
There are some configurations that may help you overcome these obstacles. Keep in mind that some of these will impact the cost that your business will incur by using Cloud Run. Let’s take a closer look.
Be Mindful Maximizing CPU and Memory
While maximizing both CPU and Memory may help, in our experience, raising your available CPU has a far greater impact on app startup time than raising Memory. Since the results may vary depending on your workload, we recommend running some experiments and seeing what works best for your applications.
In addition to this, we advise leveraging the CPU Boost option. This feature is specifically designed to reduce cold start latency while minimizing the long-term costs. It effectively doubles the CPU available to your service only during startup, allowing it to drop back down to the configured level once your service is available.
Don’t Let Old GC Algorithms Collect Dust
Garbage Collection (GC) configured naively can negatively impact performance, incurring occasional pauses while unused memory is being freed. Modern processing speed has mitigated much of this problem. However, starting with the release of Java 17, a new Z Garbage Collector promises to offer latencies of only a few milliseconds.
As GC should not impact start time significantly, we knew it wouldn’t solve our application’s cold start problem. However, the run time improvements delivered are worthwhile – especially given how simple it is to enable the latest GC algorithms. Just add the following to your Java command line:
`-XX:+UseZGC`
Probing for Answers: Startup and Liveness
Prior to getting a full understanding of the known issues above, startup and liveness probes looked like they could be the solution we were looking for. It turned out they were not – however, they did lead us to the true problems we were experiencing.
Startup probes are straightforward; you expose an endpoint from your app that will only return a 200 OK response when the app is ready to serve requests, and you point Cloud Run to that URL. During the startup process, Cloud Run will periodically hit this URL, and will only make your instance available after it returns a 200 OK response.
This can be important for cases where your instance is able to accept connections, but not yet service them – in other words, where your embedded Servlet container is ready to go, but Spring, or whatever framework you’re using, still has more work to do to be fully initialized.
Our initial theory was that our systems were scaling naturally due to some load increase, and Cloud Run was simply allocating inbound requests to our instances before they were ready. If that were the case, then startup probes may have addressed this, as long as the original instances had enough capacity to continue serving requests for a few seconds longer (which our metrics were showing should be the case). This would still leave us with long cold start times, but would largely eliminate the impact on the end user.
Liveness probes are similar, but are intended to inform Cloud Run that your service is no longer able to serve requests, and will serve as a trigger to replace that instance. They won’t help with your cold start behavior, but they can help the overall performance and responsiveness of your service.

However, after a thorough review of Cloud Run’s monitoring and logging tools, we were able to determine that this was not our issue, and instead discovered unexpected changes in our instance count around the time the issues manifested. We then were able to attribute these changes to our first known issue: Cloud Run was shutting down the only instance we had running – effectively dropping our instance capacity to zero, and forcing any inbound requests to queue up and wait for a new instance to become ready. We were able to fully understand what was happening as we dug deeper into these issues in collaboration with the Google Cloud team.
You can find more info on Cloud Run health checks at: https://cloud.google.com/run/docs/configuring/healthchecks.
Minimum Instances: The Magic Number
If instant responsiveness is important, consider configuring at least 3 instances to be available at all times. This number is recommended because, while we know there is a container scalability and performance issue in which Cloud Run instances may be terminated at any time, it’s unlikely that all three instances will be terminated at once, leaving the remaining instances to handle traffic while a new replacement is instantiated.
Though more than 3 instances might offer additional safeguards, we advise 3, keeping in mind that configuring minimum instances in Cloud Run means you’ll be billed for the constant allocation of those infrastructure resources.
Experimentation with Native Images
GraalVM’s native image feature offers another potential solution to reduce startup times. GraalVM compiles your application Ahead-of-Time (AOT) into a standalone executable, eliminating the Java virtual machine’s startup overhead. Often, this will significantly improve cold start performance and minimize memory footprint. However, creating a native image can be complex and time-consuming, and not all Java features are fully supported. For example, Google’s Java client libraries only became widely compatible with native image compilation as of May 2022.
Additionally, while native images start faster and use less memory, their peak performance may be less than JVM-based applications. This is because the JVM’s Just-In-Time (JIT) compiler optimizes code based on actual usage patterns at runtime, leading to highly optimized code for frequently used paths. GraalVM native images perform all their optimization at build time, which can lead to less optimal code if the usage patterns at runtime differ from what was anticipated at build time. However, in a serverless environment where instances are not long-lived, the impact of this difference may be less significant.
In our case, preliminary benchmarking of a simple Spring Boot application showed significant improvements in cold start times. When compiled and executed normally, a consistent startup time of 5-6 seconds could be observed in our test case, with a file size of approximately 65MB. Whereas the same application compiled to a native image using GraalVM resulted in startup times averaging 100-150 milliseconds, although the binary itself came out to be around 0.5GB.
While GraalVM may not be a silver bullet, its promise to dramatically boost startup times is quite compelling. If you can accept the larger build size and longer build times, then GraalVM can be a powerful tool for enhancing the performance and responsiveness of your Java applications in serverless environments.
Maintaining Container Startup Performance
Ultimately, the solution to any given performance problem is usually not the same for any two situations, and what is described above may not match your experience. Get to know the Cloud Run dashboards so you can effectively monitor your system. Investigate any unusual spikes in Container Instance Count and Billable Container Instance Time – especially when there’s no correlated spike in Max Concurrent Requests.
If you see that happening, you might be experiencing issues similar to those encountered in the use case I’ve described above. Though tempting to think that spikes in Container CPU Utilization and Container Memory Utilization are due to an increase in traffic, making this assumption could obscure your investigation. For example, in our case, because CPU and Memory utilization will naturally be high while new instances are starting, we saw that these spikes were an effect of the issues we were experiencing, not the cause.
And perhaps most importantly, try to keep your Container Startup Latency as low as possible, even if it means assigning more resources to your services, and measure the actual costs this accrues. If it results in a significant enough drop in processing time, you may find it offsets the cost increase.
By experimenting with your findings and following a methodical approach, you’ll fine tune your own best practices for maintaining high performance across your system. Then, you can take full advantage of the scalability and security that containerization offers while maximizing availability and ensuring a reliable and trustworthy user experience.
New to serverless? Interested in learning more about specific technical issues in serverless environments? Check out some of our previous insights: Modeling & Analyzing Lambda vs. Fargate Breakdown, and Code Reuse and Serverless Apps.