This article was written by Lee Howes and Kenny Yu from Facebook.
This is the final post in a series covering how we have used C++ coroutines at Facebook to regain stack traces for dependent chains of asynchronous waiting tasks. In previous blogs of this series, we have described how we implement async stack traces in folly, on top of C++’s coroutines. In this final post we talk about the tooling we’ve implemented to hook into this functionality and make debugging and profiling of coroutine code possible.
Most C++ developers at Facebook work on a team that operates a service. Operating a service in production requires developers to solve problems in a variety of situations.
Different tools for stack walking are suitable for each situation, and we have found it useful to provide async stack trace tooling for each:
Below, we discuss how we built these async stack trace tools and the challenges we discovered when doing so. Much of this work is open source and available in the folly repo on github.
The most common use for stack traces, and now async stack traces, across Facebook infrastructure is in profiling services.
To provide service owners, and engineers more broadly, with a good idea of what is running and where bottlenecks appear, we take sampled stack traces in production constantly. For normal synchronous stack traces this works well, but for the wide range of asynchronous code running on folly’s Futures library this provided limited information.
As coroutine use has grown in production, we added the ability to hook into async stack traces in profiling. The machinery we described is carefully designed to be processed by BPF programs running during sampling. BPF programs can walk asynchronous stacks from outside of the process, and log these sampled traces. This gives a clear idea not only what code is running, but across the fleet of machines what asynchronously depends on what.
Another time where developers commonly see stack traces is when programs crash. The folly library already provides a signal handler that prints the stack trace of the thread that is causing the program to crash, e.g., because of a SIGSEGV or SIGABRT signal. We extended this signal handler to also print the async stack trace if there is an ongoing coroutine call on the current thread.
As an example, suppose we have the following buggy program:
#include <folly/executors/CPUThreadPoolExecutor.h> #include <folly/experimental/coro/Task.h> #include <folly/init/Init.h> void crash() { char* ptr = nullptr; *ptr = 'c'; // Trigger a crash } folly::coro::Task<void> co_funcC() { crash(); co_return; } folly::coro::Task<void> co_funcB() { co_await co_funcC(); } folly::coro::Task<void> co_funcA() { co_await co_funcB(); } int main(int argc, char** argv) { folly::init(&argc, &argv); folly::CPUThreadPoolExecutor executor(1); co_funcA().scheduleOn(&executor).start().get(); return 0; }
When the program crashes, the user now sees the async stack trace along with the normal stack trace, providing more context on the asynchronous operations that led to the crash. See the example output below (addresses, file names, and some functions omitted for brevity):
*** Aborted at ... *** *** Signal 11 (SIGSEGV) ... stack trace: *** @ folly::symbolizer::(anonymous namespace)::innerSignalHandler(int, siginfo_t*, void*) @ ... @ crash() @ co_funcC() [clone .resume] @ ... @ folly::CPUThreadPoolExecutor::threadRun(std::shared_ptr<folly::ThreadPoolExecutor::Thread>) @ ... @ execute_native_thread_routine @ start_thread *** Check failure async stack trace: *** ... @ folly::symbolizer::(anonymous namespace)::innerSignalHandler(int, siginfo_t*, void*) @ ... @ crash() @ co_funcC() [clone .resume] @ co_funcB() [clone .resume] @ co_funcA() [clone .resume] @ main @ folly::detached_task()
Handling signals is tricky; signal handlers have many restrictions in order to be async-signal-safe. For example, signal handlers cannot acquire any locks, cannot allocate memory, and must only call reentrant functions. In order to print the normal stack trace and async stack trace within a signal handler, our signal handler during program initialization preallocates memory to hold a fixed number of stack frames. During signal handling, the handler avoids dynamically allocating memory by using the preallocated buffer to hold the normal stack trace, and then reuses the buffer again when walking the async stack trace.
During development, a frequent question developers have is: What is the series of function calls that led to this function being called? Folly provides a convenience function to easily print the async stack trace on demand, helping developers quickly see how a function or coroutine is called, and what is dependent on it. Below is an example of how to use the convenience function:
#include <iostream> #include <folly/experimental/symbolizer/Symbolizer.h> folly::coro::Task<void> co_foo() { std::cerr << folly::symbolizer::getAsyncStackTraceStr() << std::endl; }
Note that symbolizing stack traces is potentially expensive to call repeatedly, therefore this convenience function should only be used for debugging.
C++ developers often need to use debuggers like GDB to debug crashes after-the-fact or to investigate buggy behavior in running programs. In addition to printing the async stack trace when the program crashes, we recently implemented a GDB extension to easily print the async stack trace for the current thread from within the debugger.
Below is an example output from a GDB session where we are investigating a core dump generated from a crash of the buggy program from the previous section (addresses and file names omitted for brevity).
# Load the folly::coro gdb extension (gdb) source -v folly/experimental/coro/scripts/gdb.py Pretty printers for folly::coro. Available commands: co_bt [async_stack_root_addr] Prints async stack trace for the current thread. If an async stack root address is provided, prints the async stack starting from this root. co_async_stack_roots Prints all async stack roots. # Print the async stack trace for the current thread (gdb) co_bt 0x... in crash() () 0x... in co_funcC() [clone .resume] () 0x... in co_funcB() [clone .resume] () 0x... in co_funcA() [clone .resume] () 0x... in main () 0x... in folly::detached_task() () # Print the async stack trace for all threads (gdb) thread apply all co_bt
The source command loads the GDB extension, which provides new commands available at the GDB command line. The co_bt command prints the async stack trace for the current thread if there is a coroutine call in progress. The co_bt command can also be composed with other GDB commands, such as thread apply all co_bt, which will print the async stack trace for all threads. The GDB extension also provides a command to print async stack roots for debugging the async stack framework itself.
We used the GDB Python API to build the async stack trace GDB extension. To be most useful, we require the extension to work both with running programs and with core dumps after programs have crashed. As a result, the extension cannot make use of C++ function calls to obtain async stack trace information, as function calls will not work with core dumps. This means that the extension needs to examine pthread and async stack frame internals to extract information.
To collect the async stack trace information within GDB, the first step is to obtain the async stack root pointer from the current thread. In the glibc Native POSIX Threads Library (NPTL), pthread_t is an alias for struct pthread*, where struct pthread is the internal implementation type for a thread. NPTL represents thread-local storage using a 2-level lookup table rooted at the specific field within struct pthread. For example, given a pthread_key_t , we can find the corresponding pthread_key_data for the key at the following location:
// The key (an integer) for thread-local storage pthread_key_t key = ... // The thread we want to look up. pthread_t is // an alias for struct pthread* pthread_t thread = ... // Find the data associated with `key` within the // 2-level lookup table void* data = ((struct pthread*)thread)->specific [(int)key / PTHREAD_KEY_1STLEVEL_SIZE] [(int)key % PTHREAD_KEY_2NDLEVEL_SIZE].data;
Note that most programs use very few pthread_key_data entries. As an optimization, NPTL inlines the first block of references directly into struct pthread as the specific_1stblock field to avoid dynamically allocating memory for pthread_key_data entries for most programs. This means that given a struct pthread* pd, then pd->specific[0] == pd->specific_1stblock.
Putting it altogether, this is how the gdb extension collects the async stack trace for the current thread:
When C++ developers monitor their services in production, one important question they often investigate is: What exceptions are being thrown and where are they being thrown from? Currently, the C++ standard library does not provide facilities to easily track throw sites for exceptions. To meet this need, folly provides an exception tracer library to hook into the Itanium C++ ABI for exception handling to track exception information. Folly uses this library to provide helpers to easily track where exceptions are thrown, including both normal and async stack traces. Below is an example program that uses these helpers:
#include <iostream> #include <folly/executors/CPUThreadPoolExecutor.h> #include <folly/experimental/coro/Task.h> #include <folly/experimental/exception_tracer/SmartExceptionTracer.h> #include <folly/init/Init.h> [[noreturn]] void bad() { throw std::runtime_error("test"); } folly::coro::Task<void> co_funcC() { bad(); co_return; } folly::coro::Task<void> co_funcB() { co_await co_funcC(); } folly::coro::Task<void> co_funcA() { try { co_await co_funcB(); } catch (const std::exception& ex) { // Prints where the exception was thrown std::cerr << "what(): " << ex.what() << ", " << folly::exception_tracer::getAsyncTrace(ex) << std::endl; } } int main(int argc, char** argv) { folly::init(&argc, &argv); folly::CPUThreadPoolExecutor executor(1); co_funcA().scheduleOn(&executor).start().get(); return 0; }
Below is example output from the above program (addresses omitted for brevity):
what(): test, Exception type: std::runtime_error @ folly::exception_tracer::(anonymous namespace)::throwCallback(void*, std::type_info*, void (**)(void*)) @ __cxa_throw @ bad() @ co_funcC() [clone .resume] @ co_funcB() [clone .resume] @ co_funcA() [clone .resume] @ main @ folly::detached_task()
In this output, we can see that the exception was thrown from bad, and the output includes the async stack trace to see how bad was called.
To understand how this works, we first need to understand how the Itanium C++ ABI works. The Itanium C++ ABI specifies the behavior on how exceptions are processed. For example, whenever an exception is thrown in C++, the compiler generates a call to the following function:
// Called when an exception is thrown // - `thrown_exception` is a pointer to the thrown object // - `tinfo` contains type information about the thrown object // to match against potential catch sites // - `dest` is a function pointer to eventually destroy the thrown object void __cxa_throw( void* thrown_exception, std::type_info* tinfo, void (*dest) (void*));
In addition, the Itanium C++ ABI also specifies how exceptions are caught, rethrown, and what to do if exceptions are not caught. The folly exception tracer library allows users to specify custom callbacks to be invoked as part of exception processing. For example, the library allows users to specify a custom function to invoke whenever exceptions are thrown. To implement this, the library provides its own implementation of __cxa_throw which will first invoke the user-supplied custom callbacks, and then uses dlsym to invoke the real implementation of __cxa_throw for normal C++ exception handling.
To track normal and async stack traces where exceptions are thrown, SmartExceptionTracer supplies a callback that will do the following:
This approach adds a slight runtime overhead at the point where exceptions are thrown. Although there is some overhead, we believe this is acceptable for the following reasons:
That is the end of this series of posts. It’s a lot of information, but we hope it both helps document what folly is doing and provides inspiration about interesting ways we can make use of C++ coroutines to provide facilities that were challenging to provide otherwise.
There is future work in the implementation, where some of the costs can be mitigated by changing the compiler and the lessons we’ve learned from this implementation will influence that. However, this work is not experimental, coroutines are being used heavily in production across Facebook infrastructure, and were rapidly adopted by our developers for their ease of use and increased readability before we improved debugging and tracing. We are actively making use of this tracing infrastructure to learn about and debug async workloads in production.
To learn more about Facebook Open Source, visit our open source site, subscribe to our YouTube channel, or follow us on Twitter and Facebook.
Interested in working with open source technologies at Facebook? Check out our open source-related job postings on our career page.