Async stack traces in folly: Improving debugging in the developer lifecycle

2021年10月21日發佈者：Lee Howes

This article was written by Lee Howes and Kenny Yu from Facebook.

This is the final post in a series covering how we have used C++ coroutines at Facebook to regain stack traces for dependent chains of asynchronous waiting tasks. In previous blogs of this series, we have described how we implement async stack traces in folly, on top of C++’s coroutines. In this final post we talk about the tooling we’ve implemented to hook into this functionality and make debugging and profiling of coroutine code possible.

Most C++ developers at Facebook work on a team that operates a service. Operating a service in production requires developers to solve problems in a variety of situations.

During development and testing we might need to know who calls a given function and how we reach a particular point in the code.
While monitoring a service in production we may want to know what errors are being thrown and where they are coming from.
In failure scenarios, where there are crashes or production bugs, we need to investigate what threads were doing when the crash happened.

Different tools for stack walking are suitable for each situation, and we have found it useful to provide async stack trace tooling for each:

Development & testing:
- Provide helpers to easily print the async stack trace of the current thread on demand in code.
Monitoring the service in production:
- Provide helpers to print the async stack trace where exceptions are thrown.
Debugging crashes and other bugs in production:
- If an async operation is active, print the async stack trace by default when the program crashes.
- Provide a gdb extension to print the async stack trace from a core dump or when attached to a running program.

Below, we discuss how we built these async stack trace tools and the challenges we discovered when doing so. Much of this work is open source and available in the folly repo on github.

Production tracing

The most common use for stack traces, and now async stack traces, across Facebook infrastructure is in profiling services.

To provide service owners, and engineers more broadly, with a good idea of what is running and where bottlenecks appear, we take sampled stack traces in production constantly. For normal synchronous stack traces this works well, but for the wide range of asynchronous code running on folly’s Futures library this provided limited information.

As coroutine use has grown in production, we added the ability to hook into async stack traces in profiling. The machinery we described is carefully designed to be processed by BPF programs running during sampling. BPF programs can walk asynchronous stacks from outside of the process, and log these sampled traces. This gives a clear idea not only what code is running, but across the fleet of machines what asynchronously depends on what.

Printing async stack traces when the program crashes

Another time where developers commonly see stack traces is when programs crash. The folly library already provides a signal handler that prints the stack trace of the thread that is causing the program to crash, e.g., because of a SIGSEGV or SIGABRT signal. We extended this signal handler to also print the async stack trace if there is an ongoing coroutine call on the current thread.

As an example, suppose we have the following buggy program:

#include <folly/executors/CPUThreadPoolExecutor.h>
#include <folly/experimental/coro/Task.h>
#include <folly/init/Init.h>

void crash() {
  char* ptr = nullptr;
  *ptr = 'c'; // Trigger a crash
}

folly::coro::Task<void> co_funcC() {
  crash();
  co_return;
}

folly::coro::Task<void> co_funcB() {
  co_await co_funcC();
}

folly::coro::Task<void> co_funcA() {
  co_await co_funcB();
}

int main(int argc, char** argv) {
  folly::init(&argc, &argv);
  folly::CPUThreadPoolExecutor executor(1);
  co_funcA().scheduleOn(&executor).start().get();
  return 0;
}

When the program crashes, the user now sees the async stack trace along with the normal stack trace, providing more context on the asynchronous operations that led to the crash. See the example output below (addresses, file names, and some functions omitted for brevity):

*** Aborted at ... ***
*** Signal 11 (SIGSEGV) ... stack trace: ***
    @ folly::symbolizer::(anonymous namespace)::innerSignalHandler(int, siginfo_t*, void*)
    @ ...
    @ crash()
    @ co_funcC() [clone .resume]
    @ ...
    @ folly::CPUThreadPoolExecutor::threadRun(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)
    @ ...
    @ execute_native_thread_routine
    @ start_thread

*** Check failure async stack trace: ***
...
    @ folly::symbolizer::(anonymous namespace)::innerSignalHandler(int, siginfo_t*, void*)
    @ ...
    @ crash()
    @ co_funcC() [clone .resume]
    @ co_funcB() [clone .resume]
    @ co_funcA() [clone .resume]
    @ main
    @ folly::detached_task()

Handling signals is tricky; signal handlers have many restrictions in order to be async-signal-safe. For example, signal handlers cannot acquire any locks, cannot allocate memory, and must only call reentrant functions. In order to print the normal stack trace and async stack trace within a signal handler, our signal handler during program initialization preallocates memory to hold a fixed number of stack frames. During signal handling, the handler avoids dynamically allocating memory by using the preallocated buffer to hold the normal stack trace, and then reuses the buffer again when walking the async stack trace.

Printing async stack traces on demand

During development, a frequent question developers have is: What is the series of function calls that led to this function being called? Folly provides a convenience function to easily print the async stack trace on demand, helping developers quickly see how a function or coroutine is called, and what is dependent on it. Below is an example of how to use the convenience function:

#include <iostream>
#include <folly/experimental/symbolizer/Symbolizer.h> 

folly::coro::Task<void> co_foo() {
  std::cerr << folly::symbolizer::getAsyncStackTraceStr() << std::endl;
}

Note that symbolizing stack traces is potentially expensive to call repeatedly, therefore this convenience function should only be used for debugging.

GDB extension to print async stack traces

C++ developers often need to use debuggers like GDB to debug crashes after-the-fact or to investigate buggy behavior in running programs. In addition to printing the async stack trace when the program crashes, we recently implemented a GDB extension to easily print the async stack trace for the current thread from within the debugger.

Below is an example output from a GDB session where we are investigating a core dump generated from a crash of the buggy program from the previous section (addresses and file names omitted for brevity).

# Load the folly::coro gdb extension
(gdb) source -v folly/experimental/coro/scripts/gdb.py
Pretty printers for folly::coro. Available commands:

co_bt [async_stack_root_addr]  Prints async stack trace for the current thread.
                               If an async stack root address is provided,
                               prints the async stack starting from this root.
co_async_stack_roots           Prints all async stack roots.

# Print the async stack trace for the current thread
(gdb) co_bt
0x... in crash() ()
0x... in co_funcC() [clone .resume] ()
0x... in co_funcB() [clone .resume] ()
0x... in co_funcA() [clone .resume] ()
0x... in main ()
0x... in folly::detached_task() ()

# Print the async stack trace for all threads
(gdb) thread apply all co_bt

The source command loads the GDB extension, which provides new commands available at the GDB command line. The co_bt command prints the async stack trace for the current thread if there is a coroutine call in progress. The co_bt command can also be composed with other GDB commands, such as thread apply all co_bt, which will print the async stack trace for all threads. The GDB extension also provides a command to print async stack roots for debugging the async stack framework itself.

We used the GDB Python API to build the async stack trace GDB extension. To be most useful, we require the extension to work both with running programs and with core dumps after programs have crashed. As a result, the extension cannot make use of C++ function calls to obtain async stack trace information, as function calls will not work with core dumps. This means that the extension needs to examine pthread and async stack frame internals to extract information.

To collect the async stack trace information within GDB, the first step is to obtain the async stack root pointer from the current thread. In the glibc Native POSIX Threads Library (NPTL), pthread_t is an alias for struct pthread*, where struct pthread is the internal implementation type for a thread. NPTL represents thread-local storage using a 2-level lookup table rooted at the specific field within struct pthread. For example, given a pthread_key_t , we can find the corresponding pthread_key_data for the key at the following location:

// The key (an integer) for thread-local storage
pthread_key_t key = ...

// The thread we want to look up. pthread_t is
// an alias for struct pthread*
pthread_t thread = ...

// Find the data associated with `key` within the
// 2-level lookup table
void* data = ((struct pthread*)thread)->specific
    [(int)key / PTHREAD_KEY_1STLEVEL_SIZE]
    [(int)key % PTHREAD_KEY_2NDLEVEL_SIZE].data;

Note that most programs use very few pthread_key_data entries. As an optimization, NPTL inlines the first block of references directly into struct pthread as the specific_1stblock field to avoid dynamically allocating memory for pthread_key_data entries for most programs. This means that given a struct pthread* pd, then pd->specific[0] == pd->specific_1stblock.

Putting it altogether, this is how the gdb extension collects the async stack trace for the current thread:

Lookup the value of folly_async_stack_root_tls_key to obtain the key to use for thread-local storage.
GDB provides the pthread_t value for the thread by default. This provides us with the pointer to the NPTL thread implementation (struct pthread*).
Find the async stack root pointer from thread-local storage by looking up folly_async_stack_root_tls_key within the 2-level lookup table rooted at the specific field within struct pthread*. Note that pthread debugging symbols may not be available on all machines, and in this case, the GDB extension hardcodes the specific offset for x86_64 Linux implementations.
Now that we have the async stack root pointer, we can start walking the normal and async stack frames, starting at the top-most normal stack frame. On x86_64, we can get the top-most normal stack frame pointer with the $rbp register and its associated return address in the $pc register. The mechanism for walking follows the same steps described earlier: stepping through the AsyncStackFrame data structures and transitions to and from normal stack frames.

Tracking where exceptions are thrown

When C++ developers monitor their services in production, one important question they often investigate is: What exceptions are being thrown and where are they being thrown from? Currently, the C++ standard library does not provide facilities to easily track throw sites for exceptions. To meet this need, folly provides an exception tracer library to hook into the Itanium C++ ABI for exception handling to track exception information. Folly uses this library to provide helpers to easily track where exceptions are thrown, including both normal and async stack traces. Below is an example program that uses these helpers:

#include <iostream>
#include <folly/executors/CPUThreadPoolExecutor.h>
#include <folly/experimental/coro/Task.h>
#include <folly/experimental/exception_tracer/SmartExceptionTracer.h>
#include <folly/init/Init.h>

[[noreturn]] void bad() {
  throw std::runtime_error("test");
}

folly::coro::Task<void> co_funcC() {
  bad();
  co_return;
}

folly::coro::Task<void> co_funcB() {
  co_await co_funcC();
}

folly::coro::Task<void> co_funcA() {
  try {
    co_await co_funcB();
  } catch (const std::exception& ex) {
    // Prints where the exception was thrown
    std::cerr << "what(): " << ex.what() << ", " 
              << folly::exception_tracer::getAsyncTrace(ex)
              << std::endl;
  }
}

int main(int argc, char** argv) {
  folly::init(&argc, &argv);
  folly::CPUThreadPoolExecutor executor(1);
  co_funcA().scheduleOn(&executor).start().get();
  return 0;
}

Below is example output from the above program (addresses omitted for brevity):

what(): test, Exception type: std::runtime_error 
    @ folly::exception_tracer::(anonymous namespace)::throwCallback(void*, std::type_info*, void (**)(void*))
    @ __cxa_throw
    @ bad()
    @ co_funcC() [clone .resume]
    @ co_funcB() [clone .resume]
    @ co_funcA() [clone .resume]
    @ main
    @ folly::detached_task()

In this output, we can see that the exception was thrown from bad, and the output includes the async stack trace to see how bad was called.

To understand how this works, we first need to understand how the Itanium C++ ABI works. The Itanium C++ ABI specifies the behavior on how exceptions are processed. For example, whenever an exception is thrown in C++, the compiler generates a call to the following function:

// Called when an exception is thrown
// - `thrown_exception` is a pointer to the thrown object
// - `tinfo` contains type information about the thrown object
//   to match against potential catch sites
// - `dest` is a function pointer to eventually destroy the thrown object
void __cxa_throw(
    void* thrown_exception,
    std::type_info* tinfo,
    void (*dest) (void*));

In addition, the Itanium C++ ABI also specifies how exceptions are caught, rethrown, and what to do if exceptions are not caught. The folly exception tracer library allows users to specify custom callbacks to be invoked as part of exception processing. For example, the library allows users to specify a custom function to invoke whenever exceptions are thrown. To implement this, the library provides its own implementation of __cxa_throw which will first invoke the user-supplied custom callbacks, and then uses dlsym to invoke the real implementation of __cxa_throw for normal C++ exception handling.

To track normal and async stack traces where exceptions are thrown, SmartExceptionTracer supplies a callback that will do the following:

Obtains the normal and async stack trace in the callback. This will track where the exception was thrown.
Inserts an entry containing the stack traces into a global map, using the pointer to the thrown object as the key. The global map is protected by a shared mutex.
Provides a custom deleter that will also remove the entry from the global map when the thrown object is destroyed
When we want to print the stack traces for the throw site of an exception, we read the entries from the global map.

This approach adds a slight runtime overhead at the point where exceptions are thrown. Although there is some overhead, we believe this is acceptable for the following reasons:

Throwing exceptions should only happen in exceptional cases in production and should be rare. When exceptions are thrown, we have observed that most C++ developers at Facebook believe that better debug information is worth the tradeoff of slight additional runtime overhead.
Codepaths that are truly performance sensitive typically avoid exceptions altogether and would not use this library.

Summary

That is the end of this series of posts. It’s a lot of information, but we hope it both helps document what folly is doing and provides inspiration about interesting ways we can make use of C++ coroutines to provide facilities that were challenging to provide otherwise.

There is future work in the implementation, where some of the costs can be mitigated by changing the compiler and the lessons we’ve learned from this implementation will influence that. However, this work is not experimental, coroutines are being used heavily in production across Facebook infrastructure, and were rapidly adopted by our developers for their ease of use and increased readability before we improved debugging and tracing. We are actively making use of this tracing infrastructure to learn about and debug async workloads in production.

Other Blogs in this Series

The first post gives a high level background.
The second post discusses the differences between synchronous and asynchronous stack traces and the technical challenges of implementing traces on top of C++20 coroutines.
The third post shows how we tie async stack frames together into a chain.
The fourth post, and the last to discuss the infrastructure of async stack traces, covers walking async and synchronous stacks together and how all of this connects together.

To learn more about Facebook Open Source, visit our open source site, subscribe to our YouTube channel, or follow us on Twitter and Facebook.

Interested in working with open source technologies at Facebook? Check out our open source-related job postings on our career page.

標籤：

開發人員工具開放原始碼 2021 年開放原始碼指南

相關最新消息