Building Resilient Systems from the Customer’s Perspective

As customer expectations for fast, seamless, and always-available digital experiences continue to grow, it’s increasingly important to measure availability through the lens of the customer rather than individual applications. Traditional, more siloed approaches to measuring availability can fail to capture the full extent of how customers experience service disruptions or system latency. Resiliency is in the eyes of the user, so it’s important to orient observability around the customer experience and prioritize continuous improvement efforts geared toward the most critical customer journeys.

Historically, our teams followed the standard industry practice of instrumenting each system individually across its own technology stack. While this approach is useful for identifying and diagnosing issues that originate from within a single system, it struggles to capture inter-system dependencies or the full path of a customer journey. Under this standard approach, troubleshooting customer-reported issues often requires coordination across multiple teams and can lead to longer resolution times.

To shift toward a more customer-centric approach, we embarked on a journey to define, map, measure, and test end-to-end customer journeys. These journeys spanned multiple personas, including customers, merchants, and even internal colleagues. With this approach, the focus shifted from the success of an application to the success of the entire customer journey.

Our path to customer journey-oriented resiliency:

Define: We started by identifying a set of critical customer journeys across the company, defined as intents, such as “I want to pay my bill” or “I want to apply for a card product.” We then further split these journeys into tiers of criticality to allow us to set availability targets accordingly.
Map: This step was crucial to the overall accuracy and effectiveness of our journey approach and involved mapping each customer journey to the underlying applications, databases, third party systems and other system components that support it. These detailed journey topologies helped identify dependencies, risks, key internal stakeholders, and single points of failure, while also enabling better correlation of incidents and allowing us to detect issues before they impacted customers.
Measure: Availability reporting became the cornerstone of our customer journey-centric approach. To ensure accurate end-to-end measurement, we adopted an enterprise-wide logging framework and standard telemetry to enable traceability and transaction correlation across systems. This new monitoring allows us to observe attempted transactions, out-of-pattern traffic volumes, and business exceptions, which enables us to more quickly identify potential customer impacts that might otherwise go unnoticed through regular down-time monitoring approaches.
Test: While testing individual system failover was already a standard practice in our Disaster Recovery program, we expanded this requirement to test all systems linked to a single customer journey at the same time. These end-to-end tests involve failing over entire journeys, often comprised of dozens of interconnected systems, for multiple days. These exercises provide valuable insights into system resiliency, allowing us to uncover bottlenecks, hidden dependencies, and system latency that may not have been uncovered through isolated failover testing. By mimicking real-world scenarios, we are able to test whether an entire customer journey can withstand unexpected disruptions.

In making changes to support this customer-centric approach, it was critical to ensure that we brought stakeholders together from across the organization to support the program. We needed both top level leadership support as well as widespread buy-in from development and support teams, and we achieved this through constant collaboration and intentional expansion of customer journeys. What started with a handful of critical customer journeys has now expanded to over 65 journeys spanning every line of business.

In addition to achieving continuous year-over-year improvements to our customer journey availability, one of the most positive outcomes to this program was the cultural shift of bringing end-to-end teams together, creating shared goals, and driving value collectively. This cross-functional approach enabled best practice sharing, streamlined communication, and strengthened our overall resiliency posture beyond our defined journeys and into how we build and develop software every day.