Gett、New Relic Oneによりインシデント管理上の問題を解決
Gett is a leading ground travel platform for businesses, and its advanced technology makes business ground travel simpler, safer, and more efficient. The company’s mobility software is transforming the way businesses move their teams by combining clients’ preferred ride-hailing apps and car services onto a single SaaS platform. In 2010 Gett launched one of the first-ever on-demand B2B mobility services, and the company’s software powers, among other clients, a third of the Fortune 500.
Rapid incident management is critical to the smooth running of Gett’s international operations. Understanding and fixing performance problems affecting drivers and riders is complicated by a dynamic microservices cloud architecture, but New Relic observability is helping a talented team more than meet the challenge and deliver a superior digital experience.
Meeting 99% SLAs
One of Gett’s major challenges is ensuring its technology is reliable and available to its drivers and riders at all times, at rates of more than 99%, especially when the business experiences unexpected spikes in traffic. In these scenarios, it is crucial that the research and development team—which includes tech support and incident management—works closely with customer care on how technology is being developed, deployed, and monitored, and the impact that has on drivers and riders. Getting a comprehensive and swift understanding of how everyone is experiencing that service in real-time is paramount. But five years ago, Gett didn’t have a proper tech support team, nor a precise incident management process with the right monitoring tools in place.
As Dani Konstantinovski, global tech support manager at Gett, recalls: "When there was a problem, the first we used to hear of it was from the field. Drivers used to call our customer care team who then called us. It simply wasn't the best way to deal with putting out the fires."
"In my book, that’s a failure," adds Lior Avni, global incident manager, who works closely with Konstantinovski whenever there’s a critical incident that needs to be escalated.
Since then, Gett has invested significant time and resources to ensure they deliver a superb customer experience. "We had many challenges before, which we dealt with over the years: organization, mapping of services, missing alerts, things like that. One by one we took care of everything," explains Avni. "So right now, the only two challenges are shortening the mean time to understand, and mean time to detect."
When the team works 24/7, reducing mean time to resolve (MTTR) is critical, but the size of the production environment presents real challenges, as Konstantinovski reveals: "We are working with a microservices architecture with close to 200 microservices in our system. When something goes down, there’s usually a butterfly effect and chain reaction, and we need to find the source quickly to put out what we at Gett term as 'fires.'”
"The breadth, length and width of our production system keeps on growing," adds Avni. "The challenge is to monitor so many services and machines and get the work done in an organized way."
Answering the microservices observability challenge
As a major Amazon Web Services (AWS) user, Gett had access in the past to different monitoring tools. Although these were helpful, those tools fell short of what was needed. Having full, real-time observability over these microservices was what drove Gett to choose and then expand their use of New Relic to improve incident management and how they delivered strong digital customer experiences.
"New Relic makes our lives much, much easier. We can precisely identify the problem by jumping into New Relic to understand exactly what service is affected, what’s the reason, and what we need to do. With microservices, when one service is going down, you need to understand exactly what and where it is impacting. New Relic gives us this observability—and without it, this job would be very, very hard to do," says Konstantinovski.
The importance of full observability to how Gett manages its services means the team has grown its use of New Relic from application monitoring to now using New Relic, including its logging capabilities. Using New Relic to consolidate monitoring tools streamlines how the team understands issues as well as saves on costs, explains Konstantinovski: "We were using logging tools like ELK that we were finding complex to use. So, adding logs into New Relic was a wow moment, because for the first time, we have everything in the same system, making it so much easier to understand and to identify problems."
"Not needing to switch between tools saves us valuable minutes when we’re managing an incident, which helps us reduce our mean time to understand," adds Avni. "New Relic is now my No. 1 tool for incident management, both in how it helps manage every service and creates the notification channels to be directed to the actual engineer who owns the service. Fine-tuning alerts minimize mean time to detect from five to under two minutes, due to those specific alerts from New Relic."
A single source of truth for teams on the frontline
Managing a service used by millions daily has to take into account sudden changes in customer demand driven by unexpected events, which means the team must be extremely responsive to how technology is performing under pressure. "While we can prepare for major events like we did for the World Cup in Russia, there’s a second kind of spike like an extreme storm or one of our competitors having technical problems that lead more customers to unexpectedly choose to use our app. New Relic helps us to see exactly where these huge increases are building up and where we must add machines and capacity," explains Konstantinovski.
The speed and precision of observability change the dynamics of how these incidents are managed, allowing Gett to be more proactive in its response. "Clearer, earlier visibility of problems means that when drivers call our customer care team, they are already prepared for this call and can confidently tell them we are dealing with it. For me, a huge advantage of using New Relic is how it helps us manage the resolution of end-to-end incidents and problems together," explains Konstantinovski.
Having a comprehensive single source of truth with New Relic enables the incident management and development teams to collaborate much more closely on their prime objective of delivering an excellent digital experience. As Lena Katz, head of R&D, says: "At Gett, the developers in R&D are the owners of their services. We want them to be happy developers and don’t want to have pagers going off when it's not necessary. Because with New Relic we can see so clearly across all of our 200 or more microservices, we know exactly what’s going on and which microservice may have started the fire. So, we are able to alert the right development team."
This collaboration is important, because the tech support team are not technology specialists, which makes how New Relic directs the team with clear guidance extremely important. As Konstantinovski explains: "New Relic helps us to identify the problem exactly. Not only which microservice has a problem, but also which specific error caused this service to have this issue. So, when we contact the developers, we send a link to the specific error so they can fix the problem much faster."
The developers themselves are won over by how New Relic helps them as they go into production. "I always tell my R&D engineers, new features are nice, but if your legacy doesn't work, who cares about new features?'’ Avni says. "Customer experience needs to be first grade. This is the only thing that matters. It's not the 'everything,' it's the 'only thing.' New Relic dashboards help tremendously. When I show them, it is an 'aha' moment for them, because they have a complete picture and, with a click of a button, can see the most troublesome transaction to fix."
As a result of how Gett have both created a strong collaborative team with well-thought-out processes on incident management and made New Relic so crucial to how everyone works collaboratively, the MTTR has been reduced by 50%.
As Konstantinovski says: "You open up New Relic, and instantly you can understand where you have a problem, what's the business impact based on the microservices that have some issues. You can understand what is going to be an impact on our customer care and our clients. That helps us all work together on managing the incident smoothly."
For Katz, it is even more fundamental to how technology enables the business. "We are committed to serve our customers with the best SLA of 99%, so we need to be able to detect our issues as soon as possible and to resolve them on the spot. That's why we have invested in New Relic to observe our applications because it gives us the ability to understand what's going on at all times with our services."