Diving into Distributed Tracing

Greg Mefford
Bleacher Report Engineering
8 min readMay 28, 2019

--

An example of a distributed trace across several back-end services

At Bleacher Report, we have gotten a lot of value out of splitting up our monolithic Rails codebase into several smaller services, mostly using Elixir. We were able to significantly reduce the number of server instances we needed to absorb our large traffic spikes. Today, we operate a couple of dozen services, which are written in several languages, including C#, Elixir, NodeJS, and Ruby. We choose technologies to use for each new service based on a number of technical and organizational factors, so we need to make sure that our tools will be compatible across our whole infrastructure. One of the big challenges that came with this transition to a multi-service architecture was observability. As we added more services, it was increasingly difficult to establish the right set of vendors, libraries, and integrations to allow us to get a cohesive picture of what was going on. It was hard to diagnose where the problems were when we had high latency, high error rates, or time-outs during a big traffic spike. We often had to make a guess based on gut feeling and past experience, then hypothesize and verify with metrics and logs.

Distributed Tracing

For a good, in-depth intro on what distributed tracing is and how it works, check out Lynn Root’s blog post about it and the accompanying conference talk. I also recently gave a talk called “Continuous Tracing in Production” at Code BEAM STO 2019, so check out the slides and look for the video of that when it is posted.

The fundamental concepts of distributed tracing are traces, spans, and context propagation. Each span in a trace has a name, a range of time during which that operation happened, and some other metadata that can be associated with it. Spans know what their parent span is and what trace (i.e., related group of spans) they belong to. Context propagation is the process by which one service can tell another service how to connect its spans into the overall trace.

Tracing and context propagation overview

Context propagation is required so that downstream services can contribute their spans to the overall trace, without having to send that data back to the upstream services. Instead, they send their spans directly to the trace collector, using the same trace ID as the upstream service, so that the collector can connect them all together.

Each node sends its spans directly to the trace collector

Getting Started

With basic logging and metrics in place, filling in the distributed tracing gap was difficult to prioritize over other things. In mid-2018, we were able to allocate some time to really focus on improving our observability and monitoring across services. Our initial research led us to OpenCensus as a great option that would allow us to aggregate metrics and traces by instrumenting our code with a unified set of vendor-neutral libraries and protocols. At that time, the opencensus-beam project didn’t offer any simple drop-in integrations for our Elixir-based services, and we weren’t prepared to invest until it was clear that there was value for us in distributed tracing as a technology. We also wanted to avoid operating our own trace-collection infrastructure.

We evaluated several vendor-specific solutions but in the end decided that Datadog APM fit best with our Datadog-based metrics and alerting infrastructure we already had in place. Datadog APM supports a similar standard to OpenCensus, called OpenTracing. When we started the evaluation, they already had first-party support for C#, NodeJS, and Ruby (among others). There was no first-party support for Erlang or Elixir, but based on feedback from both Datadog and the Elixir community, we chose the Spandex library to integrate with our Elixir-based services.

Instrumenting the Code

We rolled out our Datadog APM instrumentation to our various services using a staged approach:

  1. Pick a few interconnected services
  2. Add some easy tracing integrations with popular libraries
  3. Get them running in production
  4. Verify that complete distributed traces are showing up
  5. Iteratively add more services and more manual instrumentation

Regardless of which language a given service is using, the second step was mostly a matter of configuration to hook into the instrumentation APIs provided by the libraries we use. For example, for Elixir-based services, it’s pretty easy to integrate with spandex_phoenix and spandex_ecto (there’s even a spandex_example project that has GitHub diff links in the README that show exactly what to do). For Ruby-based services, there are quite a few supported libraries that can be easily instrumented by configuring a few options. In the image at the top of the article, for example, we can see a distributed trace originating from a user-facing service at the top and passing through HTTP calls to various downstream services. The details of each span are not shown in this image, but they were nearly all generated by one of the “easy integrations.”

The fifth step is also easy to do, but it comes with some initial challenges in deciding what to trace, what conventions to use, and how to deal with idiosyncrasies of each runtime. For example, the Datadog-provided libraries for Ruby use a lot of meta-programming magic to add instrumentation to various libraries without those libraries needing to know anything about tracing. With Elixir being more explicit about where code is defined, it’s necessary to use macros or function decorators to specifically mark where we wanted to add manual tracing to the codebase.

One specific issue we struggled with in Elixir is that Spandex uses the Process Dictionary to maintain the state of the trace in a process. That means that when doing work asynchronously, we needed to pass the trace context from the parent process to the child process, so that the child could continue adding spans to the same trace, in the same way that a downstream HTTP call needs to propagate the trace state using HTTP headers. For example, we use code like the following to transparently call Tracer.async/1 when we would normally call Elixir’s Task.async/1 function.

Example code to pass Spandex’s trace context through Elixir’s Task.async

Low-Hanging Fruit

Adding this new dimension of observability to our services allowed us to immediately see some things that weren’t obvious before. For example, when a request is slow, we can now get a much better understanding of whether it’s because of downstream HTTP calls, inefficient database queries, or simply heavy in-memory processing operations. Traces allow us to see into a sampling of the details that make up our aggregated time-series metrics, which was all we had in the past. Seeing the trace visualized makes it much easier to say, “wait…why is it doing that?” and then go to the code and make an improvement. It also allows us to more easily apply Amdahl’s Law when making our performance improvements by investing our effort where it will matter the most in practice.

Another clear advantage of distributed tracing is that it allows us to see our errors and stack-traces in the broader context of how we got into a bad state. For example, we can see in the following trace that there is an error being raised in one of the downstream service calls (denoted by the red outline around one of the blue spans), but it is being handled and not causing the higher-level service to also fail.

If we click on that error span, we can see the error message itself, along with the stack-trace, showing where within that codebase the error happened. Since we can set metadata on various spans throughout the trace, it makes the process of reproducing the error much easier, because we can see what the URL was that led to the error. We might also be able to see what the relevant query or body parameters were, if we chose to include them as metadata on the spans.

Culture Change

Perhaps the most important outcome from our deployment of distributed tracing has been the shift in the way we respond to alerts and troubleshoot bugs. Before, we had to bring up a dashboard of a lot of metrics and theorize about why one or the other of them had a dip or a spike in it. We could look for errors in the logs, but more often than not, there was no error associated with a slow request or an unexpected response. Now, we can share a link to a particular trace and discuss what’s happening and why. We can look for traces of slow requests or traces that resulted in an error. All of our teams from development to operations and QA are using the same tools to monitor performance in production, test new code in staging, and investigate errors.

Trace Sampling

An important consideration when using distributed tracing at scale is how to sample the traces that get sent to the collector. As of this writing, Spandex does not have full support for trace sampling, mainly because Datadog APM uses the traces to measure aggregate service metrics, which would be incorrectly counted when using priority sampling with distributed traces. This lack of sampling support can significantly contribute to performance issues under heavy load. There are options to control the number of traces sent to Datadog APM per batch and the number of concurrent batches to send before applying synchronous back-pressure, but all of the traces are ultimately sent over HTTP to the Datadog Agent regardless of sampling priority. One technique we’ve used to mitigate this issue is to override the default HTTPoison client that Spandex uses to send traces, specifying a dedicated hackney pool, which will allow us to control the number of connections used for tracing so that the rest of the system isn’t bottlenecked when trying to make HTTP requests using the default pool.

The Future: OpenTelemetry

The worst thing about both OpenCensus and OpenTracing has always been that they both exist, and they’re very similar but incompatible. Those days are finally coming to an end because OpenCensus and OpenTracing projects are in the process of merging into a combined project called OpenTelemetry. The first release of OpenTelemetry is intended to be considered the next major release of both OpenCensus and OpenTracing. The idea is to try to make the transition as smooth as possible for clients of either of the existing projects and to sunset the existing projects by the end of 2019. This is great news for each of the projects and for the overall distributed tracing community, because it will mean that more people are working on, and using, the same set of technologies. For example, the Spandex maintainers are already teaming up with the OpenCensus-BEAM maintainers to work on the transition toward standardized OpenTelemetry support for BEAM languages.

Conclusion

At Bleacher Report, we’ve seen a lot of benefit from investing in distributed tracing in addition to our logs and metrics. Some of them were what we expected, but others were a welcome surprise. If you are interested in distributed tracing, we recommend that you get started on the OpenCensus path and follow along as the OpenTelemetry project convergence efforts continue to merge these projects into a unified telemetry platform for distributed tracing as well as metrics and logs.

--

--