Async Stack Traces for C++ Coroutines in Folly: Walking the async stack

October 14, 2021ByLee Howes

This article was written by Lee Howes and Lewis Baker from Facebook.

This is the fourth in a series of posts covering how we have used C++ coroutines at Facebook to regain stack traces for dependent chains of asynchronous waiting tasks. In the previous post we built on the basic async stack trace data structures to automatically build the linked list of stack frames at runtime. In this final post about the infrastructure of async stack traces, we’ll describe how to find and walk those stacks.

Finding the top async stack-frame

When we start capturing a normal stack-trace, we first need to obtain the pointer to the topmost stack-frame. This is easily achieved by debuggers and profilers as this pointer is usually held in a dedicated register (rbp in the case of x86_64 architectures). For async stack-traces there is no such dedicated place reserved for this, so we need to create our own.

Note that for a given thread there may actually be many active async stack-frames. For example, if an executor's event loop resumes a coroutine which then blocks on another coroutine, by calling blockingWait() , this will enter a new executor's event loop which may then run another coroutine. For example:

void compute_something();
 
folly::coro::Task<void> coro1() {
  compute_something();
  co_return;
}
 
void func1() {
  folly::coro::blockingWait(coro1());
}
 
folly::coro::Task>void> coro2() {
  func1();
  co_return;
}

So when an event loop resumes a new coroutine it needs to save the existing topmost async stack frame and restore the previous frame as the topmost async stack frame when that coroutine suspends.

The obvious strategy to use here is to use a thread_local variable to store a pointer to the top-most AsyncStackFrame. However, a naive use of thread_local has a couple of problems.

The first problem is that there is no standard way to obtain the address of a thread_local for a given thread from outside of the program. The linker can choose one of several strategies to implement thread_local variable access depending on linker settings and determining which one was chosen usually requires inspecting the assembly. This is another example of the kind of complication during stack walking that we wish to avoid.

The second problem is a potential performance cost. The thread_local variable will need to be updated every time a coroutine is resumed so that refers to the AsyncStackFrame of the currently-active coroutine. Accessing a thread_local has traditionally been a comparatively expensive operation and having to do this every call and return of a coroutine could be a significant overhead ^[1].

The strategy we chose to avoid this overhead was to cache the thread_local value in the AsyncStackFrame data structure so that we can pass the value directly from caller to callee and from callee back to caller without needing to access the thread_local. We only need to access and modify the thread_local value when work is resumed on an executor and not on every coroutine call/return. This comes at the cost of an extra pointer of storage in the coroutine frame and some extra loads and stores to copy the pointer.

The extra member, that we’ll call stackRoot, serves an additional purpose: it allows us to link from the end of a chain of async-stack frames back to a normal stack frame that is currently blocked waiting for the Task to complete. We can modify the AsyncStackFrame to add this member:

 namespace folly
{
  struct AsyncStackRoot;
  
  struct AsyncStackFrame {
    AsyncStackFrame* parentFrame;
    void* returnAddress;
    AsyncStackRoot* stackRoot;
  };
  
  struct AsyncStackRoot {
    AsyncStackFrame* topFrame;
    // ... other members
  };
}

Then modify the Task's await_suspend() method to propagate the value on co_await and co_return:

template<typename T>
__attribute__((noinline))
auto Task<T>::Awaiter::await_suspend_impl(std::coroutine_handle<> continuation,
                                          AsyncStackFrame& callerFrame) {
  auto& calleeFrame = coro_.promise().getAsyncFrame();
  calleeFrame.parentFrame = &callerFrame;
  calleeFrame.returnAddress = __builtin_return_address(0);
 
  // Now copy the cached stackRoot from caller to callee and
  // update the stackRoot to point to the callee frame.
  auto* stackRoot = callerFrame.stackRoot;
  calleeFrame.stackRoot = stackRoot;
  stackRoot->topFrame = &calleeFrame;
  callerFrame.stackRoot = nullptr;
 
  // ... other await_suspend() logic
 
  return coro_;
}
 
template<typename T>
auto TaskPromise<T>::FinalAwaiter::await_suspend(
    std::coroutine_handle<Promise> h) noexcept {
  auto& promise = h.promise();
 
  // Restore the parent AsyncStackFrame as the active async frame,
  // copying the cached stackRoot value back to the caller's frame
  // (it may be a different AsyncStackRoot from the one it had last).
  AsyncStackFrame& calleeFrame = promise.getAsyncFrame();
  AsyncStackRoot* stackRoot = calleeFrame.stackRoot;
  AsyncStackFrame* callerFrame = calleeFrame.parentFrame;
  callerFrame->stackRoot = stackRoot;
  stackRoot->topFrame = callerFrame;
 
  // Symmetric-transfer to continuation
  return promise.continuation;
}

This is a fair amount of pointer manipulation for each call/return of a coroutine, which does have some overhead but is reasonably cache friendly and so the overhead is not too great in practice. There is room for improvement and we plan to explore further the different strategies for maintaining the thread's currently active AsyncStackFrame.

The other challenge mentioned above is determining the location of the thread_local variable for a particular thread from outside of the process, using only information found in the binary and the process's memory.

Instead of trying to calculate the address of the thread_local directly, which can be difficult, we make use of pthread APIs to store some extra data for the thread.

In a Linux thread implementation that uses pthreads, each thread has a control-block whose layout is defined internal to the pthread library. This structure contains values like the thread-id, process-id, thread priorities and various other internal details of thread handling. It also contains storage for thread-local values set using the pthread_setspecific() API.

The OS usually stores a pointer to this structure in a register, for example the 'fs' register on Intel platforms, which can be obtained by profilers by inspecting the corresponding data-member of the saved CPU context. If we store the address of the thread-local value in a slot in this pthread data-structure using pthread_setspecific() we can use our knowledge of the layout of this data structure to read the thread_local's address out of the appropriate TLS slot.

Then all we need to do is export a global variable that holds the index of the TLS slot that is being used and setup some code to run on thread-creation to populate the slot with the address of that thread's thread_local.

The first part involves initialising the global variable to contain the pthread_key_t for the TLS slot it will be writing to.

// Export this as 'extern "C"' so that its symbol doesn't use
// C++ name mangling and is easier to lookup in the ELF.
extern "C" pthread_key_t folly_async_stack_root_tls_key = -1;
 
namespace {
  static pthread_once_t initialiseTlsKeyFlag = PTHREAD_ONCE_INIT;
 
  static void ensureAsyncRootTlsKeyIsInitialised() noexcept {
    (void)pthread_once(&initialiseTlsKeyFlag, []() noexcept {
      (void)pthread_key_create(&folly_async_stack_root_tls_key, nullptr);
    });
  }

We define a helper struct that holds the pointer and that on construction stores a pointer to itself in the pthread TLS slot.

// A helper struct whose constructor stores a pointer to itself in the
  // pthread TLS slot indicated by folly_async_stack_root_tls_key.
  struct AsyncStackRootHolder {
    AsyncStackRootHolder() noexcept {
      ensureAsyncRootTlsKeyIsInitialised();
      (void)pthread_setspecific(folly_async_stack_root_tls_key, this);
    }
 
    AsyncStackRoot* get() const noexcept {
      return value.load(std::memory_order_relaxed);
    }
    void set(AsyncStackRoot* root) noexcept {
      value.store(root, std::memory_order_release);
    }
    void set_relaxed(AsyncStackRoot* root) noexcept {
      value.store(root, std::memory_order_relaxed);
    }
 
    // Use an atomic variable to prevent compiler reordering of write to 'value'
    // to be visible before writes to the AsyncStackRoot object itself.
    std::atomic<AsyncStackRoot*> value{nullptr};
  };
 
  // Finally, declare the thread_local value that we will be writing to.
  static thread_local AsyncStackRootHolder currentThreadAsyncStackRoot; 
} // namespace

Whenever we register a new AsyncStackRoot, that is whenever we are about to call an async callback/continuation, we store the AsyncStackRoot on the stack and store its address in currentThreadAsyncStackRoot.value. A profiling tool can read the folly_async_stack_root_tls_key global variable and use that to lookup the pthread TLS slot which will contain the address of the pointer to the AsyncStackRoot which in turn contains a pointer to the currently-active AsyncStackFrame that represents the start of the stack.

Finding the stack-frame that corresponds to an async-frame activation

Now we have enough pieces of the puzzle in place to be able to find the current thread's active coroutine and walk its chain of async callers. However, this only gives us part of the stack trace: the async frames. It doesn't include the normal stack frames of functions that were called from the coroutine.

For example, if we had the following code:

void func_a() {
  // stack-trace sampled here
}
 
void func_b() {
  func_a();
}
 
folly::coro::Task<void7> coro_c() {
  func_b();
  co_return;
}
 
folly::coro::Task<void> coro_d() {
  co_await coro_c();
}
 
folly::coro::Task<void> coro_e() {
  co_await coro_d();
}

And a profiler captured a sample while the program was executing func_a() then walking the current thread's async stack frames would capture the following instruction-pointers:

- coro_d
- coro_e
- ... (whatever called coro_e)

Whereas what we'd like to see is:

- func_a
- func_b
- coro_c
- coro_d
- coro_e
-- ... (whatever called coro_e)

As we already know how to capture normal stack trace, all we really need to be able to do is record which of those normal stack-frames corresponds to the stack frame that resumed the coroutine (or called the callback) then cut out that frame and all subsequent frames and then splice in the async stack into the call-stack.

To achieve this, we need to record the return address of the frame that is resuming the coroutine alongside the pointer to the coroutine's AsyncStackFrame. i.e. store it in the AsyncStackRoot structure. The AsyncStackRoot structure now looks like this:

struct AsyncStackRoot {
  AsyncStackFrame* topFrame;
  void* returnAddress;
  // ... other fields
};

We can write the following helper that we can use to resume a coroutine with an AsyncStackFrame, establishing a new AsyncStackRoot for the call:

template<typename Promise>
__attribute__((noinline))
void resumeWithNewAsyncStackRoot(std::coroutine_handle<Promise> coro) noexcept{
  AsyncStackRoot thisStackRoot;
  thisStackRoot.topFrame = &coro.promise().getAsyncFrame();
  thisStackRoot.returnAddress = __builtin_return_address(0);
 
  // Save previous stack-root and install our new one
  AsyncStackRoot* prevStackRoot = currentThreadAsyncStackRoot.get();
  currentThreadAsyncStackRoot.set(&thisStackRoot);
  
  // Resume the coroutine
  coro.resume();
  
  // Restore the previous stack-root
  currentThreadAsyncStackRoot.setRelaxed(prevStackRoot);
}

When we start walking the async-stack frames we can also capture the returnAddress from the AsyncStackRoot and use this to splice together the async frames with the normal stack frames in a post-processing step.

To better understand how all of these pieces are fitting together, let's look at what the data-structures look like for the above example:

There's a lot to unpack there! Hopefully it makes sense with the prior explanations, but we'll walk through the process of how to capture the async stack trace.

When a profiler tool or debugger inspects the state of a thread for which it wants an async stack trace it will first capture a normal stack-trace: the first element populated from the rip (instruction-pointer) register. The top of the stack is referenced by the rsp register and the currently executing code by rip. The remaining elements are found by reading the returnAddress fields from the chain of stack-frames, the first of which is pointed to by the rbp (base pointer) register.

To walk the async-stack part we start by obtaining the pointer to the thread control block (stored in the fs register) and load from the folly_async_stack_root_tls_key global variable to compute the offset. We determine the address of the folly_async_stack_root_tls_key variable by inspecting the ELF symbol table of the binary.

Once we have these two pieces of information, we calculate the address of the pthread TLS slot that contains the address of the thread_local AsyncStackRootHolder.

From there we can read the AsyncStackRoot pointer. If this pointer is non-null, we read the pointer to the first AsyncStackFrame and the return address of the stack-frame that registered the AsyncStackRoot.

We truncate the normal stack-trace at the frame before this return address and start walking the chain of AsyncStackFrame objects until we get to a frame with a null parentFrame field, recording the returnAddress of each frame along the way.

Easy, right!

Switching between normal and async stack traces

There's one more thing we need to support: switching back and forth between walking normal and async stack traces.

Consider the following code example:

void some_func() {
  // sample taken here
}
 
folly::coro::Task<void> some_coro() {
  co_await ...;
  some_func();
  co_return;
}
 
void run() {
  folly::coro::blockingWait(
      some_coro().scheduleOn(folly::getGlobalCPUExecutor()));
}
 
int main(int argc, char** argv) {
  folly::init(&argc, &argv);
  run();
}

In this case, if we capture an async stack-trace sample when the thread is executing .some_func() then with the approach described above will capture the following stack-trace:

- some_func
- some_coro
- folly::coro::makeBlockingWaitTask (an internal coroutine created by blockingWait)

What we'd really like to see in this case is to continue walking the stack of the caller of blockingWait() so that we get a better picture of what was responsible for originally launching the chain of coroutine tasks. i.e.

- some_func
- some_coro
- folly::coro::makeBlockingWaitTask
- folly::coro::blockingWait
- run
- main
- ...

Further, if blockingWait() was called, either directly or indirectly from a coroutine, then we'd like to have the stack walk transition back to walking the async stack-frames once we reach the normal stack frame corresponding to the coroutine. Theoretically, the stack-walk could transition between walking normal frames and async frames an arbitrary number of times.

To support these use cases we need two new fields added to the AsyncStackRoot structure that allow us to continue walking the normal stack once we reach the end of the chain of AsyncStackFrame:

A pointer to the StackFrame of the function that registered the AsyncStackRoot
A pointer to the parent AsyncStackRoot (if any) so that we know where to stop walking normal stack frames and resume walking the next chain of AsyncStackFrame

So now we have:

namespace folly {
  struct AsyncStackRoot {
    AsyncStackFrame* topFrame;
    void* returnAddress;
    StackFrame* stackFrame;
    AsyncStackRoot* nextStackRoot;
  };
}

When a blocking function like blockingWait() launches an async operation, we store an extra AsyncStackRoot on the stack and setup a dummy AsyncStackFrame that links to it and use this AsyncStackFrame as the parent frame for the coroutine being launched.

template<typename Awaitable>
auto blockingWait(Awaitable&& awaitable) {
  AsyncStackRoot stackRoot;
  AsyncStackFrame dummyFrame;
 
  stackRoot.returnAddress = __builtin_return_address(0);
  stackRoot.stackFrame = (StackFrame*)__builtin_frame_address(0);
  stackRoot.nextStackRoot = &currentThreadAsyncStackRoot;
  stackRoot.topFrame = &dummyFrame;
  // NOTE: We don't update currentThreadAsyncStackRoot.
 
  dummyFrame.returnAddress = __builtin_return_address(0);
  dummyFrame.parentFrame = nullptr;
  dummyFrame.stackRoot = &stackRoot;
  
  // ... then launch new coroutine with 'dummyFrame' as its parent AsyncStackFrame
}

This allows the stack walker to walk from the last AsyncStackFrame (i.e. the dummyFrame) to the AsyncStackRoot and then to the StackFrame and start walking the normal stack-frames.

We can also query the AsyncStackRoot's 'nextStackRoot' member and if it's not null then look at its StackFrame pointer to see where we should stop walking the StackFrame chain and switch to walking AsyncStackFrame chain, the first frame of which is pointed to by the AsyncStackRoot's topFrame member.

That’s a lot of detail, but we hope this gives some understanding of the mechanism we’ve implemented in folly for async stack traces and how to walk them.

In the final post in the series, we’ll briefly describe the infrastructure we use, and provide, to hook into this logic which you can benefit from as a user of folly’s coroutine library.

Other Blogs in this Series

The first post gives a high level background.
The second post discusses the differences between synchronous and asynchronous stack traces and the technical challenges of implementing traces on top of C++20 coroutines.
The third post shows how we tie async stack frames together into a chain.
The final post in the series is about integration work we have done to add the folly support discussed above: crash traces, on-demand printing, gdb integration and exception tracing.

To learn more about Facebook Open Source, visit our open source site, subscribe to our YouTube channel, or follow us on Twitter and Facebook.

Interested in working with open source technologies at Facebook? Check out our open source-related job postings on our career page.

[1] A potential performance problem because compilers on certain architectures can sometimes make use of dedicated registers (e.g. the fs: segment register on Intel architectures) which hold a pointer to the thread data and can use this to make access to thread-local data as fast as access to a global variable - basically read/write to the address at a constant offset from this register.

Tags:

Developer Tools Open Source 2021 Open Source Guides

More News

Async Stack Traces for C++ Coroutines in Folly: Walking the async stack

Finding the top async stack-frame

Finding the stack-frame that corresponds to an async-frame activation

Switching between normal and async stack traces

Other Blogs in this Series

Async stack traces in folly: Improving debugging in the developer lifecycle

Async stack traces in folly: Forming an async stack from individual frames

Async stack traces in folly: Synchronous and asynchronous stack traces