Alexander Brand June 9, 2022
The gRPC Remote Procedure Framework is an essential tool in a platform engineer’s toolbox. In most cases, gRPC is used in microservice architectures to enable internal communication between microservices. With that said, gRPC can also make its way to the edge of an architecture to expose end-user-facing APIs. Recently, we helped a client investigate a puzzling issue affecting one of their most critical APIs. This post walks through the problem, the solution, and the lessons we learned while working through it.
The problem began when one of the primary API consumers started experiencing a high number of RPC exceptions intermittently throughout the day. Initially, they shrugged off the issue as a one-off, but it would keep recurring daily. After spending some time analyzing the historical client-side metrics, it seemed like the spikes in the number of exceptions occurred when the client was sending a large number of requests.
On the server side, the metrics (e.g., error rates) did not indicate the gRPC service having issues handling requests. The service was not throwing exceptions, and the logs did not include anything suspicious. From the server’s perspective, everything was operating nominally.
We continued our investigation by evaluating the entire request path. The gRPC service is running in a Kubernetes cluster. The service is exposed to external consumers using an Ingress resource—the ingress controller being the NGINX Ingress Controller. We inspected the Ingress controller’s metrics, and there was no indication of elevated error rates in the ingress layer either. At this point, we were puzzled as to whether the requests were getting lost somehow.
To better understand the issue, we asked the API consumer to share more details about the errors they were experiencing. Specifically, we wanted to know the exception’s status and message. In addition, we were curious about the distribution of the status code across the thousands of exceptions they were getting daily. As it turned out, the vast majority of the exceptions were status=Unavailable exceptions with a “GOAWAY received” message.
GOAWAYs are an HTTP/2 construct. gRPC leverages HTTP/2 as the transport protocol. In HTTP/2, GOAWAY frames inform clients that the server is shutting down. In other words, GOAWAY frames enable the graceful shutdown of an HTTP/2 server. Knowing this, we set to determine whether there was any correlation between the time Kubernetes Pods shut down (due to updates, failure, etc.) and the time the client experienced the exceptions. Unfortunately, we found no correlation. Pod restarts were not causing the problem.
Reproducing the Issue With a Test Client
As we continued digging into the problem, we decided to attempt to reproduce the problem using a Java-based CLI that sent gRPC requests to our service. We pointed the program to the staging environment and ran multiple tests but could not reproduce the issue.
Because our API consumer uses a deprecated C# gRPC client, we built an additional test program in C#. The C# gRPC client and the Java client are completely disjoint implementations, as opposed to some other clients—such as Python, Ruby, etc.—that share a core C++ foundation. Thus, we built a C# client to fully replicate our consumer’s client stack and rule out any client-side bugs. We reran our tests against our staging environment using the C# client, and this was when things got interesting.
Based on what we knew about how our user was consuming our gRPC API, the test program did the following:
Instantiate a gRPC channel and client.
Create a configurable number of C# Tasks (futures, in other programming languages) that use the client to send asynchronous RPCs to the server.
Wait for the completion of all asynchronous calls and count the number of Tasks that raised an exception.
We ran the test program multiple times with different configurations. The primary knobs we turned were the number of requests sent to the server and the thread pool size that executed the Tasks. As we ran the tests, a magic number emerged: 100. Whenever the number of threads exceeded 100, we observed Unavailable exceptions (caused by a GOAWAY) in the test program. Furthermore, the number of exceptions seemed to approximately equal the number of requests minus 100. In other words, requests other than the first 100 failed with an Unavailable exception. Where was this magic number coming from?
NGINX Ingress Controller and 100 as a Magic Number
Given that the test program was receiving a GOAWAY frame, we knew that the sender had to be a system that spoke HTTP/2. This fact ruled out the TCP load balancer fronting the Kubernetes cluster and left us with two suspects: NGINX and the gRPC service itself.
From the test program’s perspective, NGINX is the gRPC service, so we focused on NGINX first. We inspected the explicit NGINX configuration looking for directives set to 100, but nothing jumped out. We then reviewed NGINX’s HTTP/2 module documentation, searching for implicit config (i.e., defaults) that could be causing the issue. One of the configuration directives caught our attention:
http2_max_requests: Sets the maximum number of requests (including push requests) that can be served through one HTTP/2 connection, after which the next client request will lead to connection closing and the need of establishing a new connection.
While the description of the directive could explain the behavior we were seeing, the fact that the default value for this directive was 1000 instead of 100 left us scratching our heads. However, we noticed that the directive was made obsolete in NGINX versions greater than 1.19.7, and we were using version 1.19.9. In 1.19.9, the http2_max_requests directive was superseded by keepalive_requests, which defaults to 100 (newer versions default to 1000). Bingo.
At this point, we had an explanation for our magic number. The keepalive_requests directive configured NGINX to send a GOAWAY frame on a connection when it handled 100 requests. Based on this finding, we implemented retries with exponential backoff in our gRPC test program (using Polly).
Retries helped but did not eliminate the GOAWAY-related exceptions, especially during high-load situations where the client would quickly exhaust retry attempts. We took a step back and asked ourselves: How come we could not reproduce this problem using a Java-based client? We also wondered: Wouldn’t it be nice if the gRPC client would handle this automatically for us?
gRPC Transparent Retries
After spelunking through the gRPC code base, we found the gRPC Retry Design proposal. This document outlines the built-in retry strategies available in the gRPC client. Transparent retries, one of the strategies outlined in the document, was only implemented in the Java and Go clients at the time, which explains why we could not reproduce the issue with a Java client.
The transparent retries feature retries failed RPC requests if a) the request never left the client (i.e., the request remains in the client queue) or b) the request reached the gRPC server but was not seen by the server application logic. In both cases, the server business logic has not processed the request, making it possible for the gRPC client to transparently retry the request regardless of whether the request is idempotent.
In our case, we were likely experiencing a combination of both of the situations described above. However, because the C-core client—which underpins the core C# client—did not support transparent retries at the time, we had to implement retries ourselves. The good news is that the C-core client supports transparent retries starting with version 1.45.0. If you have gRPC clients that leverage the C-core library (C++, Python, Ruby, Objective-C, PHP, C#), we suggest you upgrade to v1.45.0 or greater to take advantage of transparent retries.
Resolution and Lessons Learned
To recap, the problem was that one of our most critical API consumers was experiencing GOAWAY-related exceptions intermittently, which were affecting their Service Level Objectives (SLOs). In the initial phases of our investigation, we were puzzled as the server-side logs and metrics did not indicate a problem handling requests. However, as we dug deeper into the problem, we noticed that an NGINX configuration was exacerbating the issue.
The NGINX keepalive_requests configuration directive tells NGINX to gracefully close a connection (by sending a GOAWAY frame) once it has handled a certain number of requests. In our case, that number was 100. Thus, our client would receive a GOAWAY frame for every 100 requests it sent.
Given that the gRPC client they were using did not implement transparent retries, all requests that had exited the application boundary but remained queued in the gRPC client would fail when processing the GOAWAY frame. And this behavior is the basis for one of our most critical findings: a single GOAWAY frame sent by our service was amplified to hundreds of RPCExceptions on the client side.
In the end, we addressed this problem on both the server and the client. On the server side, we increased NGINX’s keepalive_requests parameter to 1000, reducing the number of GOAWAYs our Ingress layer sends by a factor of 10. On the client side, we collaborated with our API consumer to implement retries in their client application. Additionally, we recommended they create multiple gRPC channels (i.e., connections) and load balance requests across them. Finally, after the release of transparent retries in the C-core client, we reached back out to them with a recommendation to upgrade their gRPC libraries.
If you are leveraging gRPC in your technology stack, and especially if you are exposing gRPC API endpoints to end-users, we hope that the lessons we learned while working through this issue are valuable to you. Also, if you’re passionate about working with distributed systems and digging into issues like these, we are looking for you! Check out our careers page for our openings.