Monitorama 2017 conference report

May 24, 2018

I wrote this report for internal consumption shortly after the conference. As Monitorama 2018 approaches, I thought these notes could be useful for a wider audience to revisit the common themes and interesting questions from the previous year and to see what has been done to address them.

I work at PagerDuty. The opinions and comments below are my own and do not reflect PagerDuty’s position. If you enjoyed these notes, one way to read them sooner is to work with me :)

Summary

This conference was recommended to me by multiple coworkers, and I was not disappointed. On-site, there were many interesting talks and a bunch of hallway track conversations, while the Monitorama Slack might be the most active one I have seen for a conference of this size. The organizers demonstrated that they walk the incident management talk by cleanly failing over to an alternate venue overnight as an electrical vault fire cut power to a large chunk of downtown Portland, including the original venue, for a full day starting the evening of the first day of the conference.

The first clear trend of Monitorama was socks. Every sponsor had a design to give away, and socks are an easier fit into a larger variety of wardrobes than T-shirts. I thought it was a one-time special but later learned that socks are a Monitorama tradition since 2013.

The second clear trend was a frustration with overly complex code instrumentation and dealing with multiple, often not well integrated, ways and systems of capturing and analyzing specific types of monitoring data. This trend manifested in two ways: a push away from free-form text logging into structured logging to simplify analysis and extraction of metrics, mentioned by Bryan, Anctil, Majors and Fisher; and a more ambitious argument that logs, metrics, and traces are essentially the same basic unit of monitoring, with difference between them driven mainly by implementation constraints imposed by analysis systems for achieving reasonable response times. This view was expressed by Mukerjee and Majors.

Continuing the discussion around metrics, Bryan Liles made a case for a very simple change of reporting rich health checks instead of just a status code, and both Majors and Sigelman pointed out that metrics/time series data are just the beginning of an investigation: they can indicate a problem but rarely point directly at the root cause.

With the increase in both the number of companies on a microservices architecture and the number of microservices in each, a need for infrastructure-wide tracing solution in order to both debug issues and to merely understand the constantly changing interconnections between services was clear. There were multiple vendors offering solutions. Shkuro told the story of Uber’s tracing system in detail and Sigelman extended the basic tracing model with resource tags to quickly find causes of resource starvation.

Two speakers talked about a need for a clearer separation between monitoring and alerting; both at the level of different systems and models (Nichols) and people maintaining, building, and improving the systems (Rapoport).

There was a good number of talks about both monitoring and general infrastructure of various companies, always helpful to collect some opinions on possible approaches, gauging popularity, and adopting good practices. I thought Fisher’s talk on a mostly serverless log pipeline at Lyft was the most radical architecture of the bunch–could serverless be the future of monitoring infrastructure?

Finally, the on-call community got a lot of love with talks on specific debugging tools from Evans and Creager and more general mental frameworks for debugging and major incident handling by Bennett and Reed. The on-call section was capped by Goldfuss’s powerful reminder that on-call is a necessary overhead, not a goal in itself, and we should be striving to minimize it instead of glorifying it.

Individual talks with closed captioning are now out: https://vimeo.com/channels/1255299.

Best Talks

In my opinion, these talks were the most thought-provoking of the bunch. If you do not have the time to read the entire report (and I do not blame you–it weighs in on close to 7,000 words), pay attention to these three.

Amy Nguyen’s “UX Design and Education for Effective Monitoring Tools”. An engineer, not a UX designer, Amy dove into a topic that was not covered by any other talk at the conference but is crucial to gaining acceptance and effective use of any tool, bought or custom developed, by the rest of the organization.
Yuri Shkuro’s “Distributed Tracing at Uber Scale”. In addition to describing the tools and the capabilities, Yuri’s talk offered a blueprint and a set of “carrots” and “sticks” to convince an organization to invest the effort into across-the-board distributed tracing.
The first part of Aditya Mukerjee’s talk on convergence of observability primitives and a single system that should transform them as appropriate. Then, follow with a more forceful exploration of the same concept in Charity Majors’s “Monitoring: A Post Mortem”, which a thought experiment into what could an ideal observability system look like if existing constraints to the way monitoring solutions are built were removed.

And yes, I cheated a bit and there are four must-see talks in these three bullet items.

Talk Notes

These are in schedule order.

The Tidyverse and the Future of the Monitoring Toolchain—John Rauser

The data science field has, over the years, developed a rather impressive and highly coherent toolset for slicing, dicing, and visualizing large amounts of data. Coincidentally, this is exactly what monitoring folks like doing when not fighting fires–namely, looking for trends and insights in the data collected by their monitoring systems.

The talk claimed that the toolset of data science is superior to what is currently available in the monitoring field (especially not in the open-source world) and, as more and more datasets cross the boundary between “merely large” into “data science scale”, that languages and tools explicitly designed for analyzing these datasets would supplant existing tools. For example, John expects dplyr to replace SQL for analytics knowledge world.

John demonstrated some of the components of the data science toolchain: R language and the common data interchange format–the data frame–which is a row/column table and turns out to be suitable for a very large class of problems. He also covered the tidyverse package collection and some of the packages included in it like ggplot2 (visualization, aka nice charts) and dplyr (a data manipulation tool). Better, more graphical tools would be good to increase R’s reach; Shiny is one example, and John wanted to inspire tool makers to focus on the R ecosystem.

See also:

R for data science, a free O’Reilly book.
Programming as a way of thinking: “modern programming languages are qualitatively different”.
Programming Bottom-Up, Paul Graham on Lisp.

Martyrs on Film: Learning to hate the #oncallselfie—Alice Goldfuss

Alice created the #oncallselfie, and the idea has caught on. PagerDuty added support for oncallselfies in their app, and there is a fair number of people sharing the experiences of being on-call. The talk, while acknowledging the benefits of on-call, cautions against glorifying the on-call culture and argues for cutting down the interruptive load of incidents by ruthless winnowing of events that are allowed to page on-call engineers and adding automation that removes humans from the loop.

Putting developers on-call has clear benefits. Developers are usually best equipped to solve the problems–both immediate ones and root causes–and time to recovery can improve by a lot: for example, Sean Kilgore, a Principal Engineer at SendGrid, wrote in Monitorama Slack that “our MTTR improved by about 50% in the 3 months after we put our devs on call” in an organization of about 100 engineers.

It can also be exciting to be on call and “being the parent of these little server babies”; in many organizations, Operations has a military culture that glorifies the hardship and the action of being in the middle of an incident. However, in the movies, while action is happening the plot is stopped. Plot development happens in between the action scenes. Similarly, working on-call pages prevents you from doing higher level engineering work.

“Pages stop the plot of your career, and sometimes your life.”

How bad can it be? In VictorOps’ “State of On Call” report, 17% of respondents got 100+ pages per week during their worst on-calls. There are other red flags too, like a team not keeping up (too few people owning too much); applying bandaids–changing thresholds, snoozing, waiting out services that are unstable; and no organizational visibility. If a team is struggling but the company does not know about it or does not care, the team will not get anywhere. If this is happening to ops in a non-DevOps company and the developers won’t go on call, Alice’s advice is to “exit (that team, department, company), because your career and life are more important”.

To minimize the on-call load, each alert should be actionable. In Alice’s definition, all four of the following conditions must hold:

Something has broken
Customers are affected
The on-call is the best person to fix it
This must be fixed immediately

The final point is automation. Humans cannot satisfy short SLAs, especially at night. They need to wake up, understand the problem, etc. If a SLA has many nines, it must use automation to make most availability/failover decisions, otherwise that SLA is a work of fiction.

Monitoring in the Enterprise—Bryan Liles

A few talks at this year’s Monitorama started with a statement along the lines of “big enterprise has unlimited money, they can just throw money at any problem; this talk is for the rest of us”. Bryan’s talk was interesting precisely because he comes from “big enterprise” (Capital One) and could talk about monitoring challenges at that scale and in a highly regulated environment. At Capital One, the monitoring team numbers about 30-40 hands out of 600 engineers.

“What tools do we run? All of them. Why not?”

Since money is typically not the constraint in the enterprise, it is possible to set up different tools and choose the “best of breed”. The selection of tools is iterative (“pick tools; monitor; bicker about tool choices; back to step 1”). Communication and shared tooling is particularly important: do the developers even have access to monitoring team’s tools?

Getting the most information out of services is a big factor as well. Two items Bryan singled out were structured logging and rich health checks: “it doesn’t cost too much to output richer status data in addition to a yes/no response code”.

Max’s comments: I don’t have any notes related to the flip side of having too many tools–how to determine which tool is most useful for a particular problem, and whether data goes into all tools or if there are silos of data restricted to a particular tool, and how that knowledge is shared. While money might not be a constraint, cognitive overhead exists in organizations of all sizes, and I would have liked to hear more about conscious selection and deliberate decisions about reducing the core set of tools organizations choose to use.

Yo Dawg: Monitoring Monitoring Systems at Netflix—Roy Rapoport

Roy’s talk was about the monitoring setup at Netflix, with particular focus on how to monitor the health of the monitoring system itself. He made a distinction between monitoring (a system for storing and querying specialized types of data, such as time-series data) and alerting (a system that takes in data and outputs decisions and opinions).

Netflix stores full-resolution metrics for 6 hours, which are then S3+map/reduced into 3 day buckets and then further into 18-day buckets. The 18-day tier drops node information from the data since average system lifetime at Netflix is <3 days.

Usage of the system grew by 3 orders of magnitude within 3 years of launch. Tags need to be carefully managed; the typical grouping by sets of tags can result in huge cardinality of the tag space if dimensions with large numbers of values are used. Some examples: execution time in ms (30k tag values), account ID (10^9 tag values), IP address (2^32 tag values).

The monitoring team (18 people, developers and operators) works closely with the top 10 internal Netflix applications by usage. Combined, they account for 50-60% of total metrics; the team does not worry about the rest of the applications since they do not have a systemic impact.

Our Many Monsters—Megan Anctil

Megan told the story of the Visibility Operations team at Slack. The team went with open-source solutions instead of as-a-service vendors and estimates that the savings are approximately $1m/month. The team has 5 full-time engineers.

For metrics, Slack uses a modified Graphite stack. 90 TB of data for 30 days. Custom interface to explore (show all metrics for a single host). No tags.

For logging, the setup is Logstash with Sieve - a house built replacement for ElastAlert. 250 TB of data for 2 weeks, on 450 nodes. Parsing unstructured data (with regexes, typically) is very expensive; start with structured data.

For alerting, Icinga with xinetd and nsca for active checks. 140k service checks on 6300 nodes.

Tracing Production Services at Stripe—Aditya Mukerjee

I understood this talk as containing two distinct parts. The first part talked about the future where observability primitives that are currently distinct, mostly for performance reasons of storage and querying (metrics, logs, and traces), are emitted together and intelligently processed by a single system into the most appropriate form for the application. Aditya calls the unified form the “Standard Sensor Format”.

A transition to this future would be driven both by pain of instrumenting all of those primitives separately in today’s service code (emitting metrics, logs, traces) and by the difficulty of investigating an issue based solely on one of the types of these primitives (for example, metrics can show that something is wrong, but good luck figuring out the cause from the metrics). Later in the day, Charity Majors put forth many of the same arguments, and I think it unlikely that putting these two talks echoing each other nearby in the schedule was a mere coincidence.

The second part of the talk was the discussion of Veneur, a replacement for DataDog’s collector daemon that can perform distributed aggregation of collected metrics before forwarding the data. The talk went into details of how is the aggregation distributed across machines, and into the computation of alternatives to medians (which cannot be gathered across hosts in a statistically valid way) such as T-digests.

Linux Debugging Tools You’ll Love—Julia Evans

Julia walked the audience through using several common Linux investigation and performance monitoring tools. Her approach to debugging, based on these tools, can be language-agnostic and does not depend on adding logs (“debug-by-printf”); instead, the OS-level tools can pinpoint the issue by looking at what the kernel is doing. Even without knowing the details of the program being debugged, this additional information can give a developer the nudge they need to arrive at the cause of the issue. Similarly, Julia shows that a deep knowledge of the kernel or operating systems internals is not a prerequisite to using these tools; often, the output of the tools is self-descriptive enough to make a good guess at what is happening.

Many programmers see debugging as a chore, a grunt work that is the necessary complement to the actually interesting part of writing new code. In a complete opposite of that world view, Julia’s enthusiasm for investigating hard problems and sharing the relevant knowledge is electrifying.

See also:

Comprehensive Linux performance tools charts by Brendan Gregg.
Julia’s Zines

Instrumenting Devices for Modeling the Internet—Guy Cirino

Netflix needs to see the full distribution of connection quality on the Internet to opptimize their services. Sending too many metrics from the client can result in too much data to process; Guy does not want to take averages and does not want to miss data using sampling.

Instead, Netflix chose to aggregate statistics at the client and send out aggregated data once in a while. The preferred data structure for this is a histogram, but with non-linear buckets: when there are big outliers, linear buckets are wasteful on the long end.

Monitoring: A Post Mortem—Charity Majors

The core argument of Charity’s talk was that the future of complex systems lies in observability, not monitoring. Namely, the current state-of-the-art monitoring tools focus on answering known questions about the state of the system; when configuring existing monitoring tools we think what might go wrong and set up metrics, dashboards, and alerts to catch these problems. However, many difficult engineering problems start from “unknown unknowns”, issues that we did not anticipate–and therefore our chosen metrics and dashboards might not be helpful for diagnosing such scenarios, while tools that allow general observability and open-ended, deep exploration of system state do not currently exist.

With that argument in hand, Charity took existing monitoring systems to task for insufficient flexibility and destruction of detail. In particular, she blasted fixed pre-indexing (“anything that relies on schema or index that tries to guess what you’ll need a year from now is broken”), dashboards (“I can’t pregenerate a dashboard for every possible condition”), quality of data displayed on them (“A green dashboard is a lie, since any system is never completely up”), and aggregation in general (“aggregates destroy detail… outliers are the whole damn point”).

“If a user thinks your system is down, the more you argue with them, the more credibility you lose.”

What does the future look like? It starts with data. Charity envisions many structured, unaggregated events collected in response to every single user request. A system for investigating the data should be able to perform both statistical/data analysis and show specific extreme outliers, along arbitrary dimensions and not predefined ones (so, for example, metrics tagging would not be sufficient). The data should be collected from everywhere, regardless of ownership (software written by you, open source software, perhaps service vendors like cloud providers) to enable correlation analysis across arbitrary systems related to the application.

Max’s comments: some of the reactions to the talk I heard focused on the cost; a system like this “gets pretty expensive pretty fast” and could be much slower than existing specialized metrics/logs/traces monitoring systems that achieve speed by doing a lot of computation ahead of time along predetermined dimensions. However, I do think that monitoring should be regularly revisited as capabilities develop–for example, we can store way more data at the same cost (average cost per gigabyte of spinning rust was $1.24 in 2005 and $0.02 a decade later) but many infrastructure and OS services still sparingly log single, unstructured text lines.

One thing I did not agree with is that focusing on specific users (“it’s only the experience of a single user that matters”) or unique outliers is what engineers should be doing, for the simple reason of scale. The ratio of users to engineers at WhatsApp is 18,000,000:1. At this scale, it is practically impossible for engineers to look at unique, one-off problems; prioritization would ensure they only ever have time to look at issues that impact a significant percentage of their user base. This has been the case for many massive-scale services, like Facebook and GMail, whose support for problems experienced by individual users is famously difficult to obtain. However, this does not refute the thesis that to resolve a widespread issue one might have to look in minute detail at one individual case of the problem.

Charity also mentioned that operational skills are becoming expected, “table stakes” for software engineers; however, that needs to be recognized on the organizational level as well by, for example, not promoting to a senior position an engineer who does not know how to run their services.

The Vasa Redux—Pete Cheslock

The talk explored the story of the Vasa and similarities between the process of the Vasa’s construction and some practices of the software industry.

Anomalies != Alerts—Elizabeth Nichols

Elizabeth pointed out that following the basic principles of probability, anomalies are actually very plentiful given a sufficient amount of data. For example, if a metric is sampled once per minute and there are 50,000 metrics on a normal distribution, one can reasonably expect 5,000 four-sigma (standard deviaton) events per day.

We need more context to choose the outliers that actually matter; one way to provide context would be to build a “semantic model” of the system being monitored. It is becoming common to provide feedback from humans reacting to alerts regarding the quality of alerts, and that data can feed back into the model.

Distributed Tracing—Ben Sigelman

Ben claimed tracing is becoming more indispensable than ever, as the number of possible root causes for problems surfaced in metrics that describe user-visible symptoms grows with the number of cooperating distributed systems involved in the transaction. If monitoring is about telling stories, then a microservices architecture has many storytellers, but we still want to get to the “why” part as fast as possible.

Classic tracing shows where a transaction spends most of its time, but not why. Ben hypothesised that contention for a resource is the cause of most latency issues and proposed assigning a contention ID to each resource and adding all CIDs encountered during a transaction to span data collected by tracing. Then, for a highly contested resource, it becomes possible to backtrace and aggregate to find out where do the calls blocked on the resource originate from.

Ben advocated for standard tracing format (OpenTracing) and said that a lot of the tracing data might be discarded unnecessarily; out of the possible resources conserved by discarding or sampling the data (CPU, network, and storage) only storage has a significant cost. If other resources are not constrained, sampling “in process” becomes unnecessary.

Getting metrics at Firefox—Laura Thomson and Rebecca Weiss

Obtaining detailed usage data from an open-source project committed to protecting privacy is hard. Laura and Rebecca talked about Firefox’s path to faster and better methods of gathering data.

Firefox used to build dedicated systems to answer a single specific question, it was very slow. Now, there is only one collection system that is self-service for users and prioritizes reliability and consistency to precision of the data.

One of the datasets collected is the Firefox hardware report, providing data on the capabilities of computers used by Firefox users.

Real-time Packet Analysis at Scale—Douglas Creager

Similar to Guy Cirino’s talk, Douglas also talked about obtaining network related metrics from edge devices–in this case, for Google’s services. However, and unlike Guy’s talk, this talk focused on specific methods of collection (tcpdump, libpcap) and interpretation (tracegraphs, latency/RTT, bufferbloat) of protocol-level metrics.

While collecting headers only does not result in an overwhelming amount of data (and helps with privacy and not looking at encrypted payloads), it is still not necessary to gather detailed data on all connections. Google biases sampling toward customers experiencing issues and captures a lot of detail for sampled connections. Deriving conclusions from the data is automated, and the results are sent into the monitoring stack.

Instrumenting the Rest of the Company—Eric Sigler

Eric’s unique perspective is derived from jumping between engineer and manager roles throughout his career. As an engineer, he is used to answering questions with data, but managers in meetings do not necessarily use data to justify business actions. “We have a problem with X, so we will do Y”, but why will Y help and how will success be measured?

“Without data, you are just a person with an opinion.”

“If we have data, let’s look at data. If all we have are opinions, let’s go with mine.”–Jim Barksdale, former CEO of Netscape (thanks to @marcus in Monitorama Slack)

What are some examples of metrics that can be used to answer business questions? For example, the time a CI system is waiting to start a job could indicate if there are enough build agents. The time a pull request remains open can point at velocity trends. JIRA reports (tickets closed) shows overall behavior of developers, but does not say why or where the bottlenecks are.

Eric closed his talk with a reminder of a difference between machines and humans: machines do not care which metrics are measured, but a business could choose a metric that incentivizes wrong behavior by employees.

Whispers in the Chaos—J. Paul Reed

This talk was very interesting to me, since I work at PagerDuty and incident management is our bread-and-butter. The talk presents some real-world heuristics and processes for engineers managing an incident, based on a thorough investigation of a real incident that happened at Etsy in December 2014. A significant part of the talk reviews the work done by John Allspaw for his MSc thesis.

According to Paul, engineers use the following heuristics:

Change. What has changed since the known-good state?
Go wide. When no relevant changes are found, widen the search for the problem to any potential contributors.
Convergent searching. Confirm or disqualify a theory by matching signals or symptoms to…
- a specific past diagnosis–a painful incident from the past
- a general and recent diagnosis–a recent incident (“Maybe X happened again”)

Expert engineers, based on their accumulated knowledge, can quickly:

determine typicality of the incident (is it a familiar problem? How bad is it?)
make fine discriminations (seeing fine detail others miss)
use mental simulation (what caused it and does it match what we see?)
apply higher level rules

How does one develop the expert knowledge?

Personal experience. Being challenged, “taking the pager” and doing on-call work.
Directed experience. Tutoring, training, code review, pairing, reading runbooks. Being able to tutor others, which requires deep understanding.
Training and simulation. Chaos engineering, game days, Failure Fridays.
Vicarious experiences. Especially bad or good events that become stories.

See also:

“Trade-offs under pressure”, John Allspaw’s MSc thesis
“The Field Guide to Understanding Human Error”, book by Sidney Dekker.
“Maslow’s service reliability hierarchy” diagram in the “Site Reliability Engineering” book
Etsy’s Debriefing Facilitation Guide

Monitoring @ CERN—Pedro Andrade

CERN does a lot of hard science which generates large amounts of data in realtime, so Pedro’s talk on the setup of capturing and processing this data was quite interesting.

CERN runs 14,000 servers in 2 data centers with tape backup. They use Flume (both push and pull) and aggregate all data through Kafka. File-based channels are always used, for persistence and not for capacity reasons (that’s what Kafka is used for).

Kafka details: the software has been solid. Data is buffered for 12 hours. Each data source is a topic, 20 partitions, 2 replicas/partition. Processing of data is via Spark (enrichment, aggregation, correlation - stream; compaction, reports - batch). All jobs are written in Scala, because in their experience this is a language well matched to Spark. Jobs are packaged as Docker containers, ran in Marathon or Chronos over Mesos. Marathon worked well for them; Chronos was not as successful.

The long-term storage hierarchy flows from Kafka into Flume -> HDFS (forever), ESS (1 month), InfluxDB (7 days; binned and reduced for farther away past). Two ElasticSearch instances are used, separating metrics and logs; a separate index is created per data producer. In InfluxDB, one instance is allocated to every producer. Data is accessed using Zeppelin, Kibana, and Grafana.

Monitoring infrastructure is running in Openstack VMs, except for HDFS and InfluxDB which are on physical nodes. Puppet managed configuration.

The Future of On-Call—Bridget Kromhout

On-call experience will not be significantly improved merely by better deployment tooling; investing in overall system architecture, better observability, and developer/operations culture is a more rewarding path forward, claimed Bridget in her talk that concluded the second day of Monitorama.

Some specific points of culture raised were to build trust and informal links between teams so that they will openly share with each other what is really going on. Advice was given to the product (or project) managers as well, to not hurry so much to move on to the next thing just because the current project appears to work, for some values of “work”; and to avoid hype driven development, to think deeply about the problem at hand and whether the proposed solution actually would make it better (to me, this further reinforced the point raised by Eric Sigler about gathering business metrics and data to inform decisions).

See also:

Bridget recommended “The Art of Monitoring” by James Turnbull.

Lightning Talks

Lightning talks are brief 5-minute presentations. Here are the quick summaries.

Brant Strand: a case for professional ethics and protecting user data for our profession. “You have to fight for your privacy or you will lose it”.

Logan: Optimizing yourself for learning–how to improve memory? Knowing which questions to ask, practicing retrieval of previously learned information. Taking time to praise each other for working hard and solving difficult problems.

Christine Yen: Pre-aggregated metrics are bad at answering new questions. Metrics overload–sometimes there are too many to grasp/make sense of them all.

Eben Freeman: using Linux perf (probe, record, report) to investigate hard problems. “When measurement is hard, diagnoses are circumstantial, and weird stuff goes unexplained”–I liked this statement because it also goes the other way; if things in the infrastructure are diagnosed with uncertainty and weird stuff is dismissed as “just the way things are”, perhaps more monitoring is in order.

Joe Gilotti: Iris, an incident management tool developed in house.

Me: DNSmetrics, a tool to monitor DNS service providers.

Tommy Parnell: it was just after my talk so I did not take notes. Sorry!

Scott Nasello: on ChatOps, which are good for onboarding new team members since they can observe interaction and customers. Make people write a new ChatOps command on day one.

Monitoring in a World Where You Can’t “Fix” Most Of Your Errors—Brandon Burton

Hailing from TravisCI, Brandon was facing a particular monitoring challenge: many of the issues raised by TravisCI builds do not originate from Travis infrastructure, instead they are GitHub / Python / RubyGems failures. Monitoring for these issues has to rely almost exclusively on log output produced by builders and searching that output for problems.

One question raised in the talk is whether making a stream of text (with potentially sensitive contents) indexed and searchable increases a risk to users, in addition to making monitoring easier. The architecture flows the log data from RabbitMQ to PostgreSQL to aggregation and archival; ElasticSearch is used for searching.

TravisCI runs a lot of non-building infrastructure on Heroku and claims they have fairly few long-lived, “owned” (I guess this means actively managed?) machines.

UX Design and Education for Effective Monitoring Tools—Amy Nguyen

Amy’s talk focused on users of monitoring systems, talking about the principles followed when building a custom metrics visualization tool at Pinterest to make it user-friendly but powerful enough for experts. Many of the tips and insights provided by Amy generalize easily to any internally built tool, whether brand new or evolving, and I thought this was one of the more insightful and broadly applicable talks this year.

No one wants to use monitoring tools. If they do it it means something is wrong.

A good monitoring tool is aimed at both experts and non-experts; not everyone is, or should be, an expert user at interpreting monitoring data. Amy suggested the following guidelines to make the system non-expert-friendly:

Not overloading users with too much information.
Best practices: using domain expertise to determine the most helpful default behavior. For example, making simplifying assumptions that almost always make sense, such as always spreading an alert over 10 minutes instead of immediately alerting when a threshold is crossed.
Making it hard to do the wrong thing.
Making it easy to try things without fear of breaking stuff. Exploration is important because we don’t know in advance what questions people might want to answer with the data.

Some features are appreciated by all classes of users. Performance is always valued, because everyone wants to reach actionable conclusions faster; do whatever it takes to make tools fast. For monitoring, this means rolling up data over longer time ranges; caching data in memory, for example with Facebook Gorilla/Beringei, or adding a caching layer like Turn’s Splicer. Frontend UI tricks can be a quick way to add speed and can be as simple as not reloading existing data when bounds change, or lazy-loading pieces of content when they are about to come into view.

Advanced/expert users would appreciate some accommodation as well. One way to do this is by switching to advanced UI mode if advanced stuff (like braces, or commands) are typed in. Max’s comment: An example of simple/expert UI can be found in Datadog, with both a graphical and a free-form query entry UI available with a quick switch on the Monitor screen.

Having a tool that looks good also, perhaps unconsiously, promotes trust in the system. Invest some effort into a polished user experience.

“I would say Grafana is aesthetically beautiful, and our tool is not, and that is so important. if the tools are ugly, people trust them less. It’s sad, but it’s true! if the margins are inconsistent or things look off, the trust is gone and people feel anxious.”

I asked a follow-up question about soliciting initial feedback from customers while building the tool:

“Good question! I would say release small, incremental changes over time. We didn’t release all of the things we built all at once–we slowly changed things we thought were painful, and then waited to see if people commented on it (and a lot of people sent messages saying “I like this new thing!” and that’s how we knew we were on the right track)… The benefit of being an in-house developer is that you can walk right up to your customers, show them your ideas on a piece of paper, and ask them what they think.”

A few numbers: Pinterest logs 100 TB/day at 2.5 million metrics per second. The engineering org numbers around 400. They considered Grafana plugins but decided on a custom tool.

Automating Dashboard Displays with ASAP—Kexin Rong

ASAP is an algorithm for automatic smoothing of noisy data. The main challenge is removing the noise, the short term fluctuations, while retaining “outlyingness”, or longer-term trends–the accurate term used to describe that is “kurtosis”.

The algorithm has been implemented in Macrobase, a system for diagnosing anomalies, and there is a graphite function for it.

The talk is based on this paper by the presenter and Peter Bailis.

The End of User-Based Monitoring–François Conil

I did not take any notes for this talk, sorry!

Consistency in Monitoring with Microservices at Lyft–Yann Ramin

Yann described Lyft’s infrastructure and monitoring setup. Lyft has three core languages: Python, Golang, and a PHP monolith. There are shared base libraries for each language. Different services use different repositories. Configuration management of the base layer is done with masterless Salt.

Lyft has an infrastructure team that enables others, not operates; the company does not have dedicated Operations or SRE teams.

For monitoring, statsd is used as collection protocol; a modified version of Uber’s statsrelay and statsite are used to process and pipe the data into time series databases. The default sample period is 60 seconds; per host data is processed locally, with central aggregation done on service level.

Critical to Calm: Debugging Distributed Systems—Ian Bennett

Ian offered a framework of two processes to calmly debug complex systems, with some examples of actual issues encountered at Twitter.

The first process is the measure -> analyze -> adjust -> verify loop. The loop is done for a single change, not multiple, because inferring which of the multiple changes is causing the needle to move is much more difficult than trying one thing at a time.

The second process is the “onion” of investigation, with progressively deeper level of effort and involvement in the system being worked on: metrics, tuning, tracing and logs analysis, profiling, and custom instrumentation/changing the service code to test a theory.

The onion should be descended one layer at a time: test a theory on a particular layer, verify. If no further progress is being made, move down a layer. The same process can and should be followed for critical issues (“the pager’s angry”).

Common issues encountered: logging; auto-boxing in Java; caching; concurrency issues; and expensive RPC calls.

Ian calls out Finagle for being great for composability and debuggability: your service is a function, easy to add filters between functions and inject tracing/investigation code there.

Managing Logs with a Serverless Cloud–Paul Fisher

Lyft manages their logs using a mostly serverless pipeline in AWS, which is a rather remarkable setup, and Paul explained how it works.

Logs are piped into Heka (end-of-lifed by Mozilla, but still works fine), collected by Kinesis Firehose as JSON. Firehose flushes into S3, which is the permanent archive and is fairly cheap (2-3 cents/GB). 10 GB/minute ingested with about 20 sec lag. Ingestion into ESS is done on Lambda. Since CPU for Lambda can be scaled, Lyft can pay more for faster execution and get better throughput, resulting in a constant cost-per-item. Failures are written into SQS and back-filled by successful jobs. Individual services are rate-limited in DynamoDB. Logs are searched with Kibana. Reverse proxy in infrastructure to authenticate before htting ESS (which has no VPC support). ElastAlert, written by Yelp, for alerting and regex runing on logs; the entire pipeline is completed under one minute.

Lyft employs structured logging with a message and KV API. Max’s notes: it does not have formatting into human-readable text, printf-like. Would that be useful? The reason is that running regular expressions to extract KV from logs continuously to generate real-time metrics is quite expensive.

Using the cost-efficient solutions for different purposes:

stats for metrics
tracing for debugging/sampling
logs for incident triage & debugging. Do not log anything when everything is fine.

Some notes on ESS scaling: the original ESS cluster had 10 r3.8x data nodes, 3 m4.l masters. 10K JSON docs/average. However, managed ESS did not scale well, causing support issues and with the optimal solution requiring specialized clusters/node types for different types of data: hot/warm/cold nodes for ingest, rollover, and forcemerge.

See also:

Managing Elasticsearch time-based indices efficiently

Distributed Tracing at Uber Scale—Yuri Shkuro

Yuri told the story of rolling out tracing at a large company. How large? Uber has between 2 and 3 thousands of microservices. “At some point we had more services than engineers, so we had to hire our way out”.

Adding tracing throughout the system is hard, expensive, and boring. Tracing is both a newer and a harder problem than metrics/logging, specifically because the tracing context needs to be propagated throughout all the microservices on the way to be useful, from input to output of each. Yuri called this the in-process context propagation problem. Thread storage can be a solution for some languages, but Golang has nothing implicit that can be used. So explicit context for languages like it. OpenTracing to the rescue (the other option is vendors that monkeypatch APIs to add “transparent” support, which can be expensive for vendors and therefore is expensive for customers).

How do we convince the organization to instrument their microservices? It’s the 2017 travelling salesman problem, and like the original it is NP-hard.

The benefits of tracing: transaction monitoring, performance/latency optimization, root cause analysis, service dependeny analysis. For Uber, service dependency analysis is the most important use case. Who are dependencies, what is the workflow–attributing traffic coming from a service’s consumers to actual business use cases. Another feature not directly related to tracing but very useful: propagation of context, also called “baggage”. Using baggage, shared services can work in a multi-tenancy model (test/prod/etc.), carry http headers, auth tokens, do request specific routing. A good “baggage” system is designed in a way that services do not need to be aware of its structure.

To sum up, tracing allows:

performance tests
capacity/cost accounting and attribution to business goals
baggage propagation

All of these are “carrots”, and eventually (with adoption, and because a non-compliant service breaks the tracing chain) these become “sticks” forcing stragglers to catch up.

Adoption is measured not only just by receiving some tracing metrics from the service, but also accounting for correct propagation (for example: there is a caller but no spans are emitted by the service; there are no routing logs, etc.)

The specific tracing implementation is Jaeger, developed and open-sourced by Uber. It is Zipkin-compatible.

How many people use tracing? @opentracing did a survery and got 42% of people/companies, but it is biased to people who are already interested… A quick show of hands– approximately 10-15% of attendees at Monitorama.

See also:

Evolving distributed tracing at Uber

Kubernetes-defined Monitoring—Gianluca Borello

In addition to being a comprehensive demonstration of SysDig’s features, the talk by Gianluca raised a number of interesting points about the role of SRE team, particularly a constrained one. It also talked about the features a monitoring tool should have to be effective in a container environment, and about monitoring-as-code in Kubernetes metadata.

When a SRE team is constrained (and in general, for optimal SRE scaling):

the team does not do instrumentation. Monitoring tools should do this.
the team does not help in producing metrics, unless they are deeply custom. Either the monitoring tool does this, or the service owners should be able to easily add their own.
all metrics must be tagged without SRE involvement, ideally by using some metadata that is interpreted by metrics capturing tooling.
there should be an easy way to direct metrics, alerts, dashboards to relevant teams easily. Metadata helps here as well.

The same great things containers allow also make containers more opaque. A container aware monitoring solution can help by reaching into the container from the host (the other option is explicitly configured monitoring sidecars) and tagging/grouping data both by container and by higher order primitives, such as Kubernetes pods, automatically.

SysDig used the free-form Kubernetes job’s metadata.annotations field to specify monitoring as code: define team members, dashboards, and alert specifications.

Closing

Some metrics of the event:

614 attendees, 582 on waiting list this year.
1146 people on the conference Slack.
50% women speakers and aiming for 75% next year.
Up to 800 clients on the conference WiFi.
5,000 live stream viewers.
1 underground fire in downtown Portland.

Other Summaries and Reports