We have a Java Thrift service that makes RPC calls to a downstream dependency. The calls are made asynchronously using Guava's Futures.addCallback:

javaFuture<Response> future = dependency.callAsync(request); Futures.addCallback(future, callback, callbackExecutor);

The incident

The downstream dependency became slow (latency spiked to ~10+ minutes) Shortly after, our service completely froze - it stopped accepting any incoming requests The thrift.active_requests metric dropped to 0, even though clients were actively sending requests We observed high GC CPU usage during the freeze All machines of our service froze at roughly the same time. After restarting the service, everything returned to normal

What I don't understand: Since we're using async futures with callbacks, request threads should be freed immediately after registering the callback. They shouldn't be blocked waiting on the slow dependency. So why did the service freeze? And why was active_requests = 0 - shouldn't requests at least be entering the handler?

My theories

GC pressure from accumulated futures? With a 10min timeout and high request rate, maybe hundreds of thousands of pending futures accumulated in memory, causing GC thrashing that froze all threads including the acceptor?

Some non-async outgoing blocking call? I did an initial pass on the codebase and I didn't find such calls but I could try gain.

I'm looking for some plausible explanation for what could have happened. I know this is a very open-ended question, but I'd really some pointers in the right directions. I'm really stressed out :(

nz_21's user avatar

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.