ARTICLE AD BOX
We have a Java Thrift service that makes RPC calls to a downstream dependency. The calls are made asynchronously using Guava's Futures.addCallback:
javaFuture<Response> future = dependency.callAsync(request); Futures.addCallback(future, callback, callbackExecutor);The incident
The downstream dependency became slow (latency spiked to ~10+ minutes) Shortly after, our service completely froze - it stopped accepting any incoming requests The thrift.active_requests metric dropped to 0, even though clients were actively sending requests We observed high GC CPU usage during the freeze All machines of our service froze at roughly the same time. After restarting the service, everything returned to normalWhat I don't understand: Since we're using async futures with callbacks, request threads should be freed immediately after registering the callback. They shouldn't be blocked waiting on the slow dependency. So why did the service freeze? And why was active_requests = 0 - shouldn't requests at least be entering the handler?
My theories
GC pressure from accumulated futures? With a 10min timeout and high request rate, maybe hundreds of thousands of pending futures accumulated in memory, causing GC thrashing that froze all threads including the acceptor?
Some non-async outgoing blocking call? I did an initial pass on the codebase and I didn't find such calls but I could try gain.
I'm looking for some plausible explanation for what could have happened. I know this is a very open-ended question, but I'd really some pointers in the right directions. I'm really stressed out :(
7,7818 gold badges46 silver badges98 bronze badges
Explore related questions
See similar questions with these tags.
