Uncovering the Unknown Unknown

May 16, 2023ByAndrew Reedy

Meta has an ecosystem of products that are used by billions of users every day, whether for social connections or for business engagement. These billions of users also ask us to continually and–at an accelerated pace–deliver new functionality and improvements in our products. Due to the large user base, we deliver thousands of code changes each day to meet those expectations.

Continuously releasing such a large number of product improvements creates the potential for application health regressions that can lead to user experience degradation. At Meta, we have a sophisticated layering of prevention tools to guard against the release of faulty product changes. These prevention tools monitor various product signals and detect regressions prior to the release of apps. The three areas these signals fall into are performance, reliability (functional and systemic), and efficiency. These are collectively referred to as the “PRE” of the products. Prevention starts with various observability and analytical traces collected while the developers are authoring code. In addition, observability and monitoring tools collect data from internal usage of the products in pre-production environments. Both manual and automated quality assurance tools test the products in those pre-production environments.

And while these mechanisms exist to assess the quality of code changes prior to their landing in production–and therefore avoid user impact–it may happen that user-impacting changes do make it into production and hence need to be reworked and landed once again. In order to mitigate the possibility of an issue affecting a large population, we use various mechanisms to incrementally release changes to the public.

When there is a departure from the functionality that a user expects, we define that as a regression. We use the word signal to indicate that users are letting us know that something is not meeting their expectations. We collect regression signals as soon as systemically possible through data insights and instrumented code, and then work to reduce user impact by deploying fixes. In those instances when we don’t get a strong signal from our tooling, we rely on user reports as the signal that a regression has taken place. These reports come through various mechanisms, such as users shaking their phones to report a problem in the context of where they are in the application at the time the problem occurs.

To ensure that we meet the needs of our customers and do so as quickly as possible–sometimes within the same day–Meta has invested heavily in a number of reliability programs and systems to enable significant responsiveness-at-scale.

Signals can come from our public users (individuals, companies, groups, etc.) or from internal users. We carefully scrutinize the signals from internal users in order to help prevent unwelcome changes from landing in production. Note that signals are aggregates; for example, signals include the number of bugs per million users. Signals can also come from automated testing tools and scenarios, development tools (we warn developers of changes that could cause runtime issues), system logs, and many other sources. We also have tools that harness these signals so as to quickly pinpoint causes and appropriate fixes.

Once a regression is identified, we proceed to quickly find the relevant engineers to apply fixes and release them. Depending on the complexity of the regression, we employ a number of methods to pinpoint the right engineers, including leveraging Meta’s acumen in machine learning. Due to how machine learning works, this is a continual process as we release new features and retrain our models to keep up with the constant change.

And yes, there is an endgame: We progress from detecting regressions, to mitigating them, to preventing them and sustainably containing them. In the process, we maximize responsiveness, return on investment, and efficiency.

In the next section, we’ll discuss in more detail how we establish with certainty that a regression has occurred, and knowing that, how we respond.

Translating Unknowns to Knowns & Quantifying Them

When we speak of unknowns, we are referring to a bug that is being experienced by a user or system. We might first learn about a possible bug from a number of different sources.

Synthetic testing of the proper functioning of our applications and specific products within those applications.
Internal users who are testing to find bugs. This can include our dedicated test teams, or the broader user base within Meta.
External users who are experiencing an issue using the problem report dialog in our apps.

These bugs are unknowns until they are actually flagged as incidents being experienced by users. When the number of such bug reports reaches a certain threshold, we consider them a true positive, representing a regression event: a code change has created a new bug and we have to proceed quickly to mitigate. There are two major types of reports:

“Known unknowns” - when instrumented code detects and reports the regression; and we have a number of tools that help produce this signal
“Unknown unknowns” - when the regression is not caught internally and we rely on the user to actually report the regression

Additionally, regressions fall into several categories such as whether they represent code bugs, something more like spam, or some other violation of content policy. Our focus is on the first area, where we want to address code-related user experience issues. How do you do that with users reporting thousands of issues?

Producing Signals and Insights

To manage the large volume of reports, we generate specific signals that flag issues for us. The first type of signal is one that, with properly tuned thresholds, informs us that indeed an incident or regression has occurred. Note that multiple events may have occurred depending on the type of regression. For example, a regression might affect multiple products because of a common code base. We look for such commonalities and address them as a single incident to optimally align resources. The graph below is an example of how we can visually see the departure from the trendline indicating a regression.

We then collect additional signals, some of which indicate the type of regression we may be facing. These signals provide insight into the symptoms associated with the regression, and are generated by machine learning algorithms that are tuned for particular Meta products. The algorithms analyze the user reports to determine what caused the regression and hence which engineering team can best address it by deploying a code fix or other measure.

Many other signals are also generated based on log data that has been collected. These diagnostic logs can be client-side or server-side logs that are–with user consent–collected to help determine the cause. Applications incorporate proper consent flows to determine if this type of data can be collected based on the user’s permissions. We also factor in the longevity of this data to comply with applicable regulations.

The combination of signals enables us to home in on the regression and assign it to the correct product team to quickly mitigate the issue and restore the user’s experience to what they expect.

Insights are also derived from the findings in aggregate. We look at the data so as to find common root causes and determine whether changes need to be made to prevent the regressions from recurring. Doing so is a key aspect of having a robust approach to addressing user experience issues as it helps us minimize the regression surface over time.

Our mobile applications not only include navigation paths that allow the user to report a problem, but also provide the ability to indicate a problem by shaking the device. That way, in-context telemetry can be collected. That telemetry is further enriched via Black Box signals coming from the app. Black Box is like a flight recorder which automatically captures certain logs (as consented to by users) about what was going on with the application at the time the bug was reported so that we can better diagnose the problem.

All this telemetry is stitched into a set of timeline views to help engineering teams determine what caused the issue. Due to the rapid release cadence of our apps, as well as changes in server-side application components, pinpointing the issue would be difficult without this capability. Having this telemetry over time helps us detect patterns in the regressions to prevent them in the future.

Scaling to Smaller Product Surfaces

As we have improved our ability to generate more precision in our signals, we are able to attack more of the regression surface, parts of which would have previously gone undetected. A surface represents a particular offering within a product, such as News Feed or Reels. This is a challenge because products not only have many granular features, but new features are introduced regularly. Hence, not only is the regression surface large, but it also grows over time. Therefore, our defensive and offensive strategies must continually adapt.

In order to scale, there are several conditions that need to be met:

We must have precise coverage and proper classification of existing product surfaces
We must reduce the regression surface for existing products (read: prevention)
We must expand the reach of our algorithms to new product surfaces

Without fulfilling the above conditions, we risk having a bloated system that cannot be managed due to size, complexity, and inaccuracy. So our efforts are about maintaining such a balance while also efficiently running our day-to-day operations.

Moving Beyond Bugs

As we continue to reduce the regression surface (without letting our guard down), we increase our focus in other important areas: prevention and usability.

Prevention is concerned with:

How do we learn from prior regressions?
What defenses do we implement such that these regressions do not occur again?
How do we over time ensure that the regression surface keeps decreasing?

As discussed before, we aggregate data from prior regressions to learn from them. Those learnings may translate into:

Placing markers in our code to alert us in the future, while still in a pre-production environment, so that we can stop an undesirable change from progressing into production
Increasing test coverage at various levels to detect such regressions
Changing the product itself to address these issues

Prevention helps us to keep shifting regressions further left in the development lifecycle and hence to keep reducing costs. This is done by getting sufficiently rich signals earlier in the development process, beginning with code authoring and pre-production release cycles. This not only improves the user experience but also cuts development costs and helps us divert resources to new product development.

Improving on our ability to handle regressions more efficiently enables us to focus more on user confusion or usability issues. These are higher value concerns as they are focused on improving products as opposed to fixing them. Over time, the proportion of usability issues to regressions should increase, and the total number of usability issues and regressions should decline.

Ultimately, this results in a user experience which is continually improving along both the functionality and usability dimensions.

This article was written in collaboration with Bijan Marjan, a Technical Program Manager at Meta.

Tags:

Developer Tools Open Source 2023

More News

Uncovering the Unknown Unknown

Translating Unknowns to Knowns & Quantifying Them

Producing Signals and Insights

Scaling to Smaller Product Surfaces

Moving Beyond Bugs

Meet the Developers - React @ Meta Edition (Vitalii Topoliuk)

Meet the Developers: Big Data Edition (Neerad Somanchi)

Presence Platform | An overview