As a global tech platform for luxury fashion, failure and downtime are not an option. Senior Principal Engineer Manuel Garcia at farfetch.com explains how he and his team of over 130 engineers get ready for Black Friday with observability.

FARFETCH, like most retail businesses, is already gearing up for one of the biggest shopping days of the year. As the leading global technology platform for the luxury fashion industry, customer experience needs to be seamless. And Black Friday is no exception.

On this day, millions of customers scour their favorite sites for the best deals on exclusive products. In 2020, online shopping on Black Friday surged by 22% to a record $9 billion, according to Adobe Analytics—and farfetch.com traffic tripled.

As Senior Principal Engineer, I support the engineering domain responsible for farfetch.com. I'm tasked with preparing for the sales season, which includes analyzing and preparing all actions in order to prevent any severe failure or downtime. Here's how my team uses observability to win on Black Friday.

Start below the glass

At farfetch.com, everything our customers experience on Singles’ Day, Black Friday, and Cyber Monday—whether it’s on desktop, browser, or mobile phone—is below the glass. We look at everything through that critical interface.

During this busy time, my engineering team works 24/7 to proactively find and fix problems before customers are impacted. This means architecting our software, infrastructure, and cloud investments to scale while providing the efficiency, uptime, and performance crucial to strong customer experience. 

For farfetch.com it’s about having ROBUST software at the core of everything we do. And it’s not without challenges. To prepare our site for triple the traffic during the holiday season, we need team collaboration, and most importantly, the insight to adapt our infrastructure to add new services and capabilities. We need to deliver business growth without compromising on user experience. 

As a team, we are ambitious. We want to do better than last year; we want to improve and exceed expectations. New Relic is crucial to achieve this and ensure that we have a successful sales season.

Our approach to Black Friday is not just reactive. A vital element of our preparation begins months before and focuses on testing, testing, and more testing.

Plan, test, observe, repeat

Built on a microservice architecture, farfetch.com is part of a complex system. This type of architecture relies on a network of hundreds of microservices, new capabilities, and input/output work to ensure our customers get the best experience on Black Friday. But these downstream services often use different technologies and are built by different people and behave differently under high loads. 

To add to the complexity, we are managing huge stores of data across multiple geographical locations and different data stores. This data needs to be quickly processed, understood, and acted on. Then, our platform can pivot based on traffic, customer behavior, and business needs.

We built our platform on Microsoft Azure to combat some of these problems, which gives us the agility to scale based on Black Friday traffic. We also rotate our teams to monitor every part of the platform 24/7, distributing responsibility across different departments. This gives my engineering teams autonomy while ensuring that every aspect of our site is covered on the big day.

But our approach to Black Friday is not just reactive. A vital element of our preparation begins months before and focuses on testing, testing, and more testing. 

I work across my team to review alerts and timeouts leading up to the day. We try to guarantee any necessary fallback/redundancy in our architecture as a Plan B. This contingency plan can be a real lifesaver during a peak day. We conduct fire drills on the platform, execute load tests, look for weaknesses, and create reports for the different microservices with massive help from New Relic. 

We also create data in New Relic to monitor our resilience. We can tell when a service is degraded or unavailable, and when circuit breakers are opened/closed. What’s even better is that we can monitor the offload we create on other services, and if we need to switch off features in order to preserve the best possible user experience. 

We also implement code freezes to stabilize the platform as much as possible. It gives my team a clear view and a replication of the platform on high traffic days so we can quickly anticipate and troubleshoot issues before they impact users. 

Two models stand in neutral-coloured luxury clothes in a desert landscape.

One clear view

Good dashboards and one clear view of our platform are invaluable tools, both for my engineering team and the different stakeholders we report back to. We did a lot of work over the last 24 months to make that possible.

We built an observability map—programmability platform—to zoom into different data center performances, which gives us a perfect bird’s eye view. We also created satellite dashboards so that my team has clear actionable insights to monitor our microservices and our users' browsers on Black Friday. Dashboards also allow our main stakeholders to see how our site dealt with periods of high traffic, high revenue, and high risk—like Black Friday.

Observability and New Relic are embedded in farfetch.com and how we manage our platform. It allows us to follow important signals, such as response time and error rate, and gives all of us a birds-eye view of what’s going on across multiple microservices and key markets. 

New Relic is a tool that my engineering team uses every day. Not just on Singles’ Day, Black Friday, and Cyber Monday.