Back to News for Developers

Async stack traces in folly: Introduction

September 16, 2021ByLee Howes

This article was written by Lee Howes and Lewis Baker from Facebook.

Facebook’s infrastructure is highly distributed and relies heavily on asynchronous programming. Most of our services that are written in C++ are implemented using async programming frameworks like folly::Futures and folly::coro. Async programming is an important tool for scaling a process with a relatively small number of threads to handle a much larger number of concurrent requests. However, these async programming models have typically come with some downsides.

One of these downsides is that existing debugging and profiling tools that make use of stack-traces to provide information about what your program was doing generally give poor quality data for async code.

This is the first in a short series of blog posts about how we are leveraging C++ coroutines to improve this situation, and describing the technical details of our solution. This first post gives a high level background, and later posts in the series go into fairly deep technical detail about the implementation in folly and surrounding tooling to assist debugging and profiling of coroutine code.

Why async stack traces?

For some years, the library we have relied on for async programming at Facebook is folly’s Futures library, which allows code like the following:

Executor& e = ...;
Future<void> s = makeSemiFuture()
    .via(&e)
    .thenValue([](auto&&) {
        doWork();
    });
doComplexOperation();
Future<void> s2 = std::move(s)
    .thenValue([](auto&&) {
        doMoreWork();
    });

        

where the lambda containing doWork will run at some point on the thread pool represented by e.

Async code like the example above generally involves launching an operation and then attaching a callback or continuation to that operation that will be executed when that operation completes. doComplexOperation can run on the main thread concurrently with doWork running on some other thread owned by the Executor. In general, this avoids much of the overhead of thread-context switches that you get when using thread-per-task concurrency. However, this also means that the callback is often executed in a different context from the context that launched it, usually from an executor's event-loop.

Under normal circumstances, when code is launched on an executor, for example a thread pool, a stack trace will represent the path from the thread’s execution loop to the function being run. Conceptually, the threads in the executor do something like:

Executor::run() {
  while(!cancelled) {
    auto t = queue_.getTask();
    t();
  }
}

        

If we run this in a debugger and break inside doWork the stack trace will look like:

- doWork
- <some function type-erasure internals...>
- Executor::run
- <some executor / thread startup internals...>
- __thread_start
        

The trace will only cover stack frames from doWork’s body down to the run method of the executor. The connection to the calling code, and to doMoreWork is lost. This loss of context can make debugging or profiling such code very challenging.

It is hard to fix this loss of context in futures code like the example above. With a normal stack-trace, when one function calls another, the full call stack is preserved for the duration of the call, allowing any stack walks performed while the function is executing to be able to easily trace back through the calling code without having to do any more work. However, with futures, the code that launches an operation continues executing and may unwind its call-stack before the continuation attached to the future runs. The calling context is not preserved and we’d need to effectively take a snapshot of the call stack at the time when the operation is launched to be able to later reconstruct the full call stack, incurring a large runtime overhead. Coroutines offer us a nesting that makes this cleaner:

Executor& e = ...;
auto l = []() -> Task<void> {
    co_await doWork();
    co_await doMoreWork();
};
l().scheduleOn(e).start();

        

The compiler transforms this to something like the futures code above, but structurally there is a big difference: the next continuation is in the same scope as the parent. This makes the idea of a “stack” of coroutines make much more sense, at least syntactically. Taking advantage of this and helping us with debugging is still a technical challenge.

Coroutines as callbacks

C++ uses a style of coroutines that suspend by return from a function: that is to say that we do not suspend the entire stack, we return from the coroutine and have the suspended state stored separately. This is distinct from the “fiber” style of coroutine where we suspend the entire stack. The implementation of this looks something like a series of callbacks, and a hidden linked list of chained coroutine frames exposed through the sequence of function calls.

There are big advantages to the style of coroutines chosen for C++, which we will not go into here, but one downside is that while the nested structure is there in the code, it is not directly apparent in the stack frames visible to a debugger or profiling tool.

To illustrate this problem, consider the following more complicated code snippet of folly::coro code, during the execution of which we want to sample a stack trace:

void normal_function_1() {
  // ... expensive code - sample taken here.
}
 
void normal_function_2() {
  // ...
  normal_function_1();
  // ...
}
 
folly::coro::Task<void> coro_function_1() {
  // ...
  normal_function_2();
  // ...
  co_return;
}
 
folly::coro::Task<void> coro_function_2() {
  // ...
  co_await coro_function_1();
  // ...
}
 
void run_application() {
  // ...
  folly::coro::blockingWait(
    coro_function_2().scheduleOn(folly::getGlobalCPUExecutor()));
  // ...
}
 
int main() {
  run_application();
} 

        

Currently, if the profiler captures a sample when code is executing within normal_function_1() then it might show a stack trace that looks something like:

- normal_function_1
- normal_function_2
- coro_function_1
- std::coroutine_handle::resume
- folly::coro::TaskWithExecutor::Awaiter::await_suspend::$lambda0::operator()
- folly::Function::operator()
- folly::CPUThreadPoolExecutor::run
- std::thread::_invoke
- __thread_start
        

Notice how only coro_function_1 appears with coro_function_2 missing. From coro_function_1 the trace moves into internal details of the framework and executor. We can't see from this stack-trace that coro_function_1() is called from coro_function_2() and that coro_function_2() is in turn called from run_application() and main().

Further, if there were multiple call-sites for coro_function_1(), in a sampled profiling system, all of their samples from different call sites will probably get merged together, making it difficult to determine which call sites are expensive. This can make it difficult to determine where you should focus your efforts when looking for performance optimisations.

Ideally, both profiling tools and debuggers would be able to capture the logical stack-trace instead The result would show the relationship between coro_function_1 and its caller coro_function_2, and we'd end up with a stack trace for this sample that looks more like this:

- normal_function_1
- normal_function_2
- coro_function_1
- coro_function_2
- blockingWait
- run_application
- main
  
        

folly support for async stack traces

folly now implements a set of tools to support async stack traces for coroutines. The library provides fundamental hooks that are used by internal code profiling libraries. Those same hooks provide access to stack traces for debugging purposes.

These are briefly summarised here and we will go into detail in a later post.

Printing async stack traces when the program crashes

Probably the most frequent place where developers see stack traces is when programs crash. The folly library already provides a signal handler that prints the stack trace of the thread that is causing the program to crash. The signal handler now prints the async stack trace if there is a coroutine active on the current thread when the crash happens.

Printing async stack traces on demand

During development, a frequent question developers have is: What is the series of function calls that led to this function being called? folly provides a convenience function to easily print the async stack trace on demand, helping developers quickly see how a function or coroutine is called:

#include <iostream>
#include <folly/experimental/symbolizer/Symbolizer.h> 

folly::coro::Task<void> co_foo() {
  std::cerr << folly::symbolizer::getAsyncStackTraceStr() << std::endl;
}

        

GDB extension to print async stack traces

C++ developers often need to use debuggers like GDB to debug crashes after-the-fact or to investigate buggy behavior in running programs. We recently implemented a GDB extension to easily print the async stack trace for the current thread from within the debugger:

# Print the async stack trace for the current thread
(gdb) co_bt
0x... in crash() ()
0x... in co_funcC() [clone .resume] ()
0x... in co_funcB() [clone .resume] ()
0x... in co_funcA() [clone .resume] ()
0x... in main ()
0x... in folly::detached_task() ()

        

Tracking where exceptions are thrown

We want to be able to tell where an exception was constructed/thrown from when it is later caught. To meet this need, folly provides an exception tracer library to hook into the Itanium C++ ABI for exception handling to track exception information. We have recently expanded the set of helper functions this library provides to easily track where exceptions are thrown, including both normal and async stack traces. Below is an example program that uses these helpers:

folly::coro::Task<void> co_funcA() {
  try {
    co_await co_funcB();
  } catch (const std::exception& ex) {
    // Prints where the exception was thrown
    std::cerr << "what(): " << ex.what() << ", " 
              << folly::exception_tracer::getAsyncTrace(ex)
              << std::endl;
  }
}

        

Concluding

This new functionality brings coroutines much closer to the level of debugging and tracing support we see with normal stacks. Facebook developers already benefit from these changes, and they are open source in the folly library for anybody to use.

In subsequent posts in this series, we will go into more detail about how this is implemented. In the next post, we will discuss the differences between synchronous and asynchronous stack traces and the technical challenges of implementing traces on top of C++20 coroutines. Stay tuned!

Other posts in this series

  • The second post will discuss the differences between synchronous and asynchronous stack traces and the technical challenges of implementing traces on top of C++20 coroutines.
  • The third post will show how we tie async stack frames together into a chain.
  • The fourth post, and the last to discuss the infrastructure of async stack traces, will cover walking async and synchronous stacks together and how all of this connects together.
  • The final post in the series is about integration work we have done to add the folly support discussed above: crash traces, on-demand printing, gdb integration and exception tracing.

To learn more about Facebook Open Source, visit our open source site, subscribe to our YouTube channel, or follow us on Twitter and Facebook.

Interested in working with open source technologies at Facebook? Check out our open source-related job postings on our career page.