Distributed tracing is a powerful diagnostic tool for hybrid and microservices-based environments, because you can investigate performance issues from one place. A distributed trace consolidates records of events that take place across components of a distributed system. 

In this article, you'll learn:

  • What distributed tracing is, and how to use it
  • The structure of distributed traces, including spans and transactions, and examples in New Relic
  • How to pass trace context between services, including the W3C Trace Context Standard
  • The pros and cons of head-based and tail-based trace sampling
  • The benefits of distributed tracing

What is distributed tracing?

A distributed trace consolidates records of events that take place across components of a distributed system. These events are triggered by a single operation—such as clicking a button on a website—and they cross process, network, and security boundaries. To gain an intuitive understanding of distributing tracing, let’s define each term:

  • Distributed refers to distributed systems, which consist of independent components that communicate through requests to form an application. 
  • Tracing refers to traces, which track the end-to-end path of a request as each travels from service to service. 

Distributed tracing is an essential part of a unified application performance monitoring (APM) platform. It provides real-time visibility into the health and performance of your entire application stack when you integrate it with other observability tools, such as metrics, logs, and alerts. Distributed tracing provides two core pieces of information:

  • The path a service request takes across a distributed system
  • The time spent to complete each service request

When you’re monitoring microservices-based architectures, distributed tracing helps pinpoint where failures occur and what causes poor performance. Here's an illustration of how distributed tracing works in New Relic:

While the traces themselves contain all the relevant data for conducting root cause analysis, tracing tools differentiate themselves based on their capabilities for:

  • Ease of deployment and instrumentation
  • Visualization and querying
  • Configuration and flexibility

The structure of distributed traces

In New Relic, distributed traces gather three types of data:

  • A span is a named, timed operation that represents a piece of the workflow. Examples of span operations include datastore queries, browser-side interactions, method-level time tracking, calls to other services, and also Lambda functions. For example, in an HTTP service, you might want a span created at the beginning of an HTTP request and completed when the HTTP server returns a response. Span attributes contain important information about the operation such as duration and host data.
  • A transaction is a logical unit of work in a software application, such as HTTP requests, SQL queries, background processes, message queue activity, and so on. In New Relic, the transaction event includes information about the app, database calls, the duration of the transaction, and any errors that occur.
  • Contextual metadata shows calculations about a trace and the relationships between its spans. It also shows the duration of traces, all entities that are part of a trace, the number of entities that are part of a trace, the trace's start time in milliseconds, as well as the parent/child IDs that represent all of the span relationships within a trace.

More about spans

A span in a distributed trace represents the individual unit of work done and the time a service spends processing a request. Traces encapsulate spans in a tree-like structure: more than one child span can belong to a parent span

To understand spans in distributed tracing, you’ll need to know these concepts:

  • Trace duration is a trace's total duration, determined by the length of time from the start of the earliest span to the completion of the last span.
  • A process entry span is the first span in the execution of a logical piece of code, such as a backend service or Lambda function.
  • A process exit span is a span that is either the parent of an entry span, or if it has attributes prefixed with http. or db., an external call.
  • An in-process span represents an internal method call or function and that is not an exit or entry span.
  • A client span represents a call to another entity or external dependency. Currently, there are two client span types. First, datastore client spans have attributes prefixed with db., and second, external client spans have attributes prefixed with http. or have a child span in another process.

Here’s an example from How trace data is structured in the New Relic docs:

More about transactions

A transaction is a logical unit of work in a software application. Specifically, it refers to the function calls and method calls that make up that unit of work. In the context of application performance monitoring, it often refers to a web transaction that represents activity starting from when the application receives a web request to when the response is sent.

In her her blog post explaining distributed tracing, Erika Arnold describes three main ways distributed tracing uses transactions:

  • Analyzing transactions: Tracing monitors transactions that take place throughout the system to gain insights into its performance. Each transaction plays a role in performance, and underperforming services have a knock-on effect on the rest of the services. 
  • Recording transactions: Tracing helps keep track of lots of transactions. Tracing context that comes into a service with a request is propagated to other processes and attached to transaction data. With this context, you can stitch the transactions ]together later. Since the industry shift from monolith applications to microservices, it’s becoming increasingly important to track transactions across process boundaries where you can’t install APM agents.
  • Describing transactions: Tracing helps measure transactions, providing information such as what transactions took place and how long they lasted.

Passing trace context between services

Trace context refers to a set of HTTP headers in New Relic that propagate data from one service to another, to compose end-to-end traces. Monitoring agents add these HTTP headers to a service's outbound requests. HTTP headers identify software traces and carry identifying information as they travel through various networks, processes, and security systems. These headers include:

  • Each trace span has a guid attribute. The guid of the last span within the process is sent with the outgoing request, so that the first segment of work in the receiving service can add this guid as the parentId attribute.
  • The parent type is the source of the trace header, such as mobile, browser, or Ruby app. This becomes the parent.type attribute on the transaction triggered by the request.
  • The timestamp is the UNIX timestamp in milliseconds when the payload was created.
  • The traceId is the unique ID used to identify a single request as it crosses inter-process boundaries and intra-process boundaries. This ID helps link spans in a distributed trace. 
  • The transactionId is the unique identifier for the transaction event.
  • The priority is a randomly generated priority ranking value that helps determine which data is sampled when sampling limits are reached. 
  • The sampled boolean value tells the agent if traced data should be collected for the request. These transactions sampled for a full trace are given a true value for the sampled attribute, which propagates downstream to signal all other APM agents the trace touches to collect spans. These downstream spans also are given a true value for the sampled attribute.

Using the W3C Trace Context standard in a distributed environment

What if you’re using multiple tools in your environment? When trace context isn’t standardized, your traces can’t be correlated or propagated when they cross boundaries between different tools from different vendors. If you’re using a distributed environment with multiple middleware services and cloud platforms, this problem is critical. The W3C Trace Context standard defines a “universally agreed-upon format for the exchange of trace context propagation data.” 

The standard improves interoperability issues by providing:

  • a unique identifier for individual traces and requests.
  • an agreed-upon mechanism to forward vendor-specific trace data and avoid broken traces when multiple tracing tools participate in a single transaction.
  • an industry standard that intermediaries, platforms, and hardware providers can support.

To adhere to this standard, tracing tools must interact with trace context by propagating traceparent and tracestate headers to guarantee that the traces aren’t broken. New Relic implements this using the W3C New Relic agents, which send and receive these two required headers. The agent also sends and receives the header of the prior New Relic agent. The trace context supported by New Relic include:

  • W3C traceparent identifies the entire trace (trace ID) and the calling service (span ID). The traceparent header describes the position of the incoming request in its trace graph in a portable, fixed-length format. Every tracing tool must properly set traceparent even when it only relies on vendor-specific information in tracestate.
  • W3C tracestate carries vendor-specific information and tracks where a trace has been. The tracestate header extends traceparent with vendor-specific data represented by a set of name/value pairs. Storing information in tracestate is optional.
  • The New Relic proprietary header is the original, proprietary header that’s used to maintain backward compatibility with prior New Relic agents.

Here’s an example scenario from How trace context is passed between applications in the New Relic docs that shows the flow when a request touches an OpenTelemetry tracer, a New Relic agent that uses W3C Trace Context standard, and an older New Relic agent before the W3C Trace Context standard.

    Distributed tracing diagram that shows the flow of headers when a request touches three different agent types

    Trace sampling: Head-based and tail-based

    Trace sampling is a technique used in distributed tracing to reduce the amount of trace data that is collected and stored. Sampling the trace data reduces the overhead associated with distributed tracing and provides a representative sample of the system’s performance. There are two trace sampling methods: head-based and tail-based.

    Head-based sampling

    Head-based sampling decides to randomly select traces for collection and storage at the beginning - or that ‘head’ - of the trace. It is used to capture a representative sample of activity while avoiding storage and performance issues. The trace origin—the first service monitored in a distributed trace— chooses requests at random to be traced, and this decision propagates to downstream services touched by that request, making all the spans in the trace available in the tracing tool. 

    This also includes adaptive sampling, a technique applied to head-based sampling where APM agents adapt the limit on the number of transactions collected based on the changes in transaction throughput. If the limit is 10 traces per minute, the agent spreads out the collection of these 10 traces over a minute in order to get a representative sample over time. The rate responds to changes in transaction throughput, so if the previous minute had 100 transactions, the agent would anticipate a similar number of transactions and select 1 out of every 10 transactions to be traced.

    Tail-based sampling

    Different than head-based sampling, the trace retention decisions in tail-based sampling are done after all the spans in a trace have arrived—at the tail end. 

    Pros and cons of head-based vs tail-based sampling

     

    Head-based sampling

    Tail-based sampling

    Pros

    • Works well for applications with lower transaction throughput
    • Fast and simple to get up and running
    • Appropriate for blended monolith and microservice environments where monoliths still reign supreme
    • Little-to-no impact on application performance
    • A low-cost solution for sending tracing data to third-party vendors
    • Statistical sampling provides adequate transparency into the distributed system
    • Observes and analyzes 100% of traces
    • Samples traces after they are fully completed
    • Visualizes traces with errors or uncharacteristic slowness more quickly 

    Cons

    • Traces are sampled randomly
    • Sampling happens before a trace has fully completed its path through many services, so there is no way to know upfront which trace may encounter an issue
    • In high-throughput systems, traces with errors or unusual latency might be sampled out and missed
    • May require additional gateways, proxies, and satellites to run sampling software
    • Requires some toil to manage and scale third-party software in some cases
    • Incurs additional costs for transmitting and storing more data

     

    Benefits of distributed tracing

    According to the 2022 State of Observability Forecast, 36% of engineers already use distributed tracing, and, of the remainder, 90% of engineers plan to deploy distributed tracing by 2025. But what makes distributed tracing so helpful and why should you adopt it? Here are three simple reasons: 

    • It provides complete end-to-end visibility. Distributed tracing provides a detailed view of the full user journey–from frontend to backend–of individual requests that are processed across multiple components. This makes it easier to understand the impact of changes or find performance problems that are part of a larger system before they become critical. 
    • It allows you to decrease MTTR and MTTD. By monitoring the performance of each component in a distributed system, developers can better identify areas for improvement and optimize the system for improved performance. According to The Business Value of the New Relic Observability Platform IDC paper, ​​troubleshooting teams required significantly less time with New Relic in place to identify, manage, and resolve application-related issues. They identified issues 83% faster and resolved identified issues 27% faster. 
    • It can improve team collaboration. According to the same IDC paper cited above, after customers adopted New Relic, troubleshooting teams benefited from improved code-level visibility and metrics when working on finding and resolving issues that affect development workflows and deployments. Average team productivity increased by 43%, which translated into an average annual savings of $1.3M for each organization surveyed.

    In our case, distributed tracing is also easy to set up. You deploy one agent, and New Relic APM instruments each service involved in a request, creates timings for operations within the service, and automatically adds troubleshooting information to each span. You can then add custom attributes to transactions and see all of your information in the trace.