I wrote this report for internal consumption shortly after the conference. As Monitorama
2018 approaches, I thought these notes could be useful for a wider audience to revisit
the common themes and interesting questions from the previous year and to see what has been
done to address them.
I work at PagerDuty. The opinions and comments below are my own and do not reflect PagerDuty’s position.
If you enjoyed these notes, one way to read them sooner is to work with me :)
This conference was recommended to me by multiple coworkers, and I was not
disappointed. On-site, there were many interesting talks and a bunch of hallway track
conversations, while the Monitorama Slack might be the most active one I have
seen for a conference of this size. The organizers demonstrated that they walk
the incident management talk by cleanly failing over to an alternate venue
overnight as an electrical vault fire cut power to a large chunk of downtown Portland,
including the original venue, for a full day starting the evening of the first day
of the conference.
The first clear trend of Monitorama was socks. Every sponsor had
a design to give away, and socks are an easier fit into a larger variety of
wardrobes than T-shirts. I thought it was a one-time special but later learned
that socks are a Monitorama tradition since 2013.
The second clear trend was a frustration with overly complex code instrumentation and dealing
with multiple, often not well integrated, ways and systems of capturing and analyzing specific
types of monitoring data. This trend manifested in two ways: a push away from free-form text
logging into structured logging to simplify analysis and extraction of metrics, mentioned
by Bryan, Anctil, Majors and Fisher; and a more ambitious argument that logs,
metrics, and traces are essentially the same basic unit of monitoring, with difference
between them driven mainly by implementation constraints imposed by analysis
systems for achieving reasonable response times. This view was expressed by Mukerjee and Majors.
Continuing the discussion around metrics, Bryan Liles made a case for a very simple
change of reporting rich health checks instead of just a status code, and both Majors
and Sigelman pointed out that metrics/time series data are just the beginning of
an investigation: they can indicate a problem but rarely point directly at the root cause.
With the increase in both the number of companies on a microservices architecture
and the number of microservices in each, a need for infrastructure-wide tracing solution
in order to both debug issues and to merely understand the constantly changing interconnections
between services was clear. There
were multiple vendors offering solutions. Shkuro told the story of Uber’s
tracing system in detail and Sigelman extended the basic tracing model with resource
tags to quickly find causes of resource starvation.
Two speakers talked about a need for a clearer separation
between monitoring and alerting; both at the level of different systems and models
(Nichols) and people maintaining, building, and improving the systems (Rapoport).
There was a good number of talks about both monitoring and general infrastructure
of various companies, always helpful to collect some opinions on possible approaches,
gauging popularity, and adopting good practices. I thought Fisher’s talk on a mostly
serverless log pipeline at Lyft was the most radical architecture of the bunch–could
serverless be the future of monitoring infrastructure?
Finally, the on-call community got a lot of love with talks on specific debugging tools
from Evans and Creager and more general mental frameworks for debugging and major
incident handling by Bennett and Reed. The on-call section was capped by Goldfuss’s
powerful reminder that on-call is a necessary overhead, not a goal in itself, and
we should be striving to minimize it instead of glorifying it.
Individual talks with closed captioning are now out: https://vimeo.com/channels/1255299.
In my opinion, these talks were the most thought-provoking of the bunch. If you
do not have the time to read the entire report (and I do not blame you–it weighs
in on close to 7,000 words), pay attention to these three.
Amy Nguyen’s “UX Design and Education for Effective Monitoring Tools”. An engineer,
not a UX designer, Amy dove into a topic that was not covered by any other talk
at the conference but is crucial to gaining acceptance and effective use of any
tool, bought or custom developed, by the rest of the organization.
Yuri Shkuro’s “Distributed Tracing at Uber Scale”. In addition to describing the
tools and the capabilities, Yuri’s talk offered a blueprint and a set of “carrots”
and “sticks” to convince an organization to invest the effort into across-the-board
The first part of Aditya Mukerjee’s talk on convergence of observability primitives
and a single system that should transform them as appropriate. Then, follow with a more
forceful exploration of the same concept in Charity Majors’s “Monitoring: A Post
Mortem”, which a thought experiment into what could
an ideal observability system look like if existing constraints to the way monitoring
solutions are built were removed.
And yes, I cheated a bit and there are four must-see talks in these three bullet items.
These are in schedule order.
The data science field has, over the years, developed a rather impressive and
highly coherent toolset for slicing, dicing, and visualizing large amounts of
data. Coincidentally, this is exactly what monitoring folks like doing when not
fighting fires–namely, looking for trends and insights in the data collected
by their monitoring systems.
The talk claimed that the toolset of data science is superior to what is currently
available in the monitoring field (especially not in the open-source world) and,
as more and more datasets cross the boundary between “merely large” into “data science
scale”, that languages and tools explicitly designed for analyzing these datasets
would supplant existing tools. For example, John expects dplyr to replace SQL
for analytics knowledge world.
John demonstrated some of the components of the data science toolchain:
R language and the common data interchange
format–the data frame–which is a row/column table and turns out to be suitable
for a very large class of problems. He also covered the tidyverse package collection
and some of the packages included in it like ggplot2 (visualization, aka nice charts)
and dplyr (a data manipulation tool). Better, more graphical tools would be good
to increase R’s reach; Shiny is one example, and
John wanted to inspire tool makers to focus on the R ecosystem.
Martyrs on Film: Learning to hate the #oncallselfie—Alice Goldfuss
Alice created the #oncallselfie, and the idea has caught on. PagerDuty
added support for oncallselfies
in their app, and there is a fair number of people sharing the experiences of being
on-call. The talk, while acknowledging the benefits of on-call, cautions against
glorifying the on-call culture and argues for cutting down the interruptive load
of incidents by ruthless winnowing of events that are allowed to page on-call
engineers and adding automation that removes humans from the loop.
Putting developers on-call has clear benefits. Developers are usually best
equipped to solve the problems–both immediate ones and root causes–and time
to recovery can improve by a lot: for example, Sean Kilgore,
a Principal Engineer at SendGrid, wrote in Monitorama Slack that “our MTTR
improved by about 50% in the 3 months after we put our devs on call” in an organization
of about 100 engineers.
It can also be exciting to be on call
and “being the parent of these little server babies”; in many organizations,
Operations has a military culture that glorifies the hardship and the action
of being in the middle of an incident. However, in the movies, while action is
happening the plot is stopped. Plot development happens in between the action
scenes. Similarly, working on-call pages prevents you from doing higher
level engineering work.
“Pages stop the plot of your career, and sometimes your life.”
How bad can it be? In VictorOps’ “State of On Call” report,
17% of respondents got 100+ pages per week during their worst on-calls. There are
other red flags too, like a team not keeping up (too few people owning too much);
applying bandaids–changing thresholds, snoozing, waiting out services that are
unstable; and no organizational visibility. If a team is struggling but the company
does not know about it or does not care, the team will not get anywhere. If this
is happening to ops in a non-DevOps company and the developers won’t go on call,
Alice’s advice is to “exit (that team, department, company), because your career
and life are more important”.
To minimize the on-call load, each alert should be actionable. In Alice’s
definition, all four of the following conditions must hold:
- Something has broken
- Customers are affected
- The on-call is the best person to fix it
- This must be fixed immediately
The final point is automation. Humans cannot satisfy short SLAs, especially at
night. They need to wake up, understand the problem, etc. If a SLA has many nines,
it must use automation to make most availability/failover decisions, otherwise
that SLA is a work of fiction.
Monitoring in the Enterprise—Bryan Liles
A few talks at this year’s Monitorama started with a statement along the
lines of “big enterprise has unlimited money, they can just throw money at any
problem; this talk is for the rest of us”. Bryan’s talk was interesting precisely
because he comes from “big enterprise” (Capital One) and could talk about monitoring
challenges at that scale and in a highly regulated environment.
At Capital One, the monitoring team numbers about 30-40 hands out of 600 engineers.
“What tools do we run? All of them. Why not?”
Since money is typically not the constraint in the enterprise, it is possible to set
up different tools and choose the “best of breed”. The selection of tools is
iterative (“pick tools; monitor; bicker about tool choices; back to step 1”).
Communication and shared tooling is particularly important: do the developers even
have access to monitoring team’s tools?
Getting the most information out of services is a big factor as well. Two items
Bryan singled out were structured logging and rich health checks: “it doesn’t
cost too much to output richer status data in addition to a yes/no response code”.
Max’s comments: I don’t have any notes related to the flip side of having too many tools–how
to determine which tool is most useful for a particular problem, and whether data
goes into all tools or if there are silos of data restricted to a particular tool, and
how that knowledge is shared. While money might not be a constraint, cognitive
overhead exists in organizations of all sizes, and I would have liked to
hear more about conscious selection and deliberate decisions about reducing the
core set of tools organizations choose to use.
Yo Dawg: Monitoring Monitoring Systems at Netflix—Roy Rapoport
Roy’s talk was about the monitoring setup at Netflix, with particular focus on how
to monitor the health of the monitoring system itself. He made a distinction
between monitoring (a system for storing and querying specialized types of data,
such as time-series data) and alerting (a system that takes in data and outputs
decisions and opinions).
Netflix stores full-resolution metrics for 6 hours, which are then S3+map/reduced
into 3 day buckets and then further into 18-day buckets. The 18-day tier drops node
information from the data since average system lifetime at Netflix is <3 days.
Usage of the system grew by 3 orders of magnitude within 3 years of launch. Tags
need to be carefully managed; the typical grouping by sets of tags can result in huge
cardinality of the tag space if dimensions with large numbers of values are used.
Some examples: execution time in ms (30k tag values), account ID (10^9 tag values),
IP address (2^32 tag values).
The monitoring team (18 people, developers and operators) works closely with
the top 10 internal Netflix applications by usage. Combined, they account for
50-60% of total metrics; the team does not worry about the rest of the
applications since they do not have a systemic impact.
Our Many Monsters—Megan Anctil
Megan told the story of the Visibility Operations team at Slack. The team went
with open-source solutions instead of as-a-service vendors and estimates that the
savings are approximately $1m/month. The team has 5 full-time engineers.
For metrics, Slack uses a modified Graphite stack. 90 TB of data for 30 days.
Custom interface to explore (show all metrics for a single host). No tags.
For logging, the setup is Logstash with Sieve - a house built replacement for
ElastAlert. 250 TB of data for 2 weeks, on 450 nodes. Parsing unstructured data
(with regexes, typically) is very expensive; start with structured data.
For alerting, Icinga with xinetd and nsca for active checks. 140k service checks
on 6300 nodes.
Tracing Production Services at Stripe—Aditya Mukerjee
I understood this talk as containing two distinct parts. The first part talked
about the future where observability primitives that are currently
distinct, mostly for performance reasons of storage and querying (metrics, logs,
and traces), are emitted together and intelligently processed by a single system
into the most appropriate form for the application. Aditya calls the
unified form the “Standard Sensor Format”.
A transition to this future would be driven both by pain of instrumenting all
of those primitives separately in today’s service code (emitting metrics,
logs, traces) and by the difficulty of investigating an issue based
solely on one of the types of these primitives (for example, metrics can show that
something is wrong, but good luck figuring out the cause from the metrics). Later
in the day, Charity Majors put forth many of the same arguments, and I think it
unlikely that putting these two talks echoing each other nearby in the schedule
was a mere coincidence.
The second part of the talk was the discussion of Veneur,
a replacement for DataDog’s collector daemon that can perform distributed
aggregation of collected metrics before forwarding the data. The talk went into
details of how is the aggregation distributed across machines, and into the
computation of alternatives to medians (which cannot be gathered across hosts in
a statistically valid way) such as T-digests.
Julia walked the audience through using several common Linux investigation and
performance monitoring tools. Her approach to debugging, based on these tools,
can be language-agnostic and does not depend on adding logs (“debug-by-printf”);
instead, the OS-level tools can pinpoint the issue by looking at what the kernel
is doing. Even without knowing the details of the program being debugged, this
additional information can give a developer the nudge they need to arrive at the
cause of the issue. Similarly, Julia shows that a deep knowledge of the kernel or
operating systems internals is not a prerequisite to using these tools; often,
the output of the tools is self-descriptive enough to make a good guess at what
Many programmers see debugging as a chore, a grunt work that is the necessary
complement to the actually interesting part of writing new code. In a complete
opposite of that world view, Julia’s enthusiasm for investigating hard problems
and sharing the relevant knowledge is electrifying.
Instrumenting Devices for Modeling the Internet—Guy Cirino
Netflix needs to see the full distribution of connection quality on the Internet
to opptimize their services. Sending too many metrics from the client can result
in too much data to process; Guy does not want to take averages and does not want
to miss data using sampling.
Instead, Netflix chose to aggregate statistics at the client and send out aggregated
data once in a while. The preferred data structure for this is a histogram, but
with non-linear buckets: when there are big outliers, linear buckets are wasteful
on the long end.
Monitoring: A Post Mortem—Charity Majors
The core argument of Charity’s talk was that the future of complex systems lies in
observability, not monitoring. Namely, the current state-of-the-art
monitoring tools focus on answering known questions about the state of the system;
when configuring existing monitoring tools we think what might go wrong and set
up metrics, dashboards, and alerts to catch these problems. However, many difficult engineering
problems start from “unknown unknowns”, issues that we did not anticipate–and
therefore our chosen metrics and dashboards might not be helpful for diagnosing
such scenarios, while tools that allow general observability and open-ended, deep
exploration of system state do not currently exist.
With that argument in hand, Charity took existing monitoring systems to task for
insufficient flexibility and destruction of detail. In
particular, she blasted fixed pre-indexing (“anything that relies on schema or
index that tries to guess what you’ll need a year from now is broken”), dashboards
(“I can’t pregenerate a dashboard for every possible condition”), quality of data
displayed on them (“A green dashboard is a lie, since any system is never
completely up”), and aggregation in general (“aggregates destroy detail… outliers
are the whole damn point”).
“If a user thinks your system is down, the more you argue with them, the more credibility you lose.”
What does the future look like? It starts with data. Charity envisions many
structured, unaggregated events collected in response to every single user request.
A system for investigating the data should be able to perform both statistical/data
analysis and show specific extreme outliers, along arbitrary dimensions and not predefined
ones (so, for example, metrics tagging would not be sufficient). The data should
be collected from everywhere, regardless of ownership (software written by you,
open source software, perhaps service vendors like cloud providers) to enable
correlation analysis across arbitrary systems related to the application.
Max’s comments: some of the reactions to the talk I heard focused on the
cost; a system like this “gets pretty expensive pretty fast” and could be much
slower than existing specialized metrics/logs/traces monitoring systems that achieve
speed by doing a lot of computation ahead of time along predetermined dimensions.
However, I do think that monitoring should be regularly revisited as capabilities
develop–for example, we can store way more data at the same cost (average cost
per gigabyte of spinning rust was $1.24 in 2005 and $0.02 a decade later) but many
infrastructure and OS services still sparingly log single, unstructured text lines.
One thing I did not agree with is that focusing on specific users (“it’s only the
experience of a single user that matters”) or unique outliers is what engineers
should be doing, for the simple reason of scale. The ratio of users to engineers
at WhatsApp is 18,000,000:1.
At this scale, it is practically impossible for engineers to look at unique, one-off
problems; prioritization would ensure they only ever have time to look at issues
that impact a significant percentage of their user base. This has been the case
for many massive-scale services, like Facebook and GMail, whose support for problems
experienced by individual users is famously difficult to obtain. However, this
does not refute the thesis that to resolve a widespread issue one
might have to look in minute detail at one individual case of the problem.
Charity also mentioned that operational skills are becoming expected, “table stakes”
for software engineers; however, that needs to be recognized on the organizational
level as well by, for example, not promoting to a senior position an engineer
who does not know how to run their services.
The Vasa Redux—Pete Cheslock
The talk explored the story of the Vasa and
similarities between the process of the Vasa’s construction and some practices of
the software industry.
Anomalies != Alerts—Elizabeth Nichols
Elizabeth pointed out that following the basic principles of probability, anomalies
are actually very plentiful given a sufficient amount of data. For example, if a
metric is sampled once per minute and there are 50,000 metrics on a normal distribution,
one can reasonably expect 5,000 four-sigma (standard deviaton) events per day.
We need more context to choose the outliers that actually matter; one way to
provide context would be to build a “semantic model” of the system being monitored.
It is becoming common to provide feedback from humans reacting to alerts regarding
the quality of alerts, and that data can feed back into the model.
Distributed Tracing—Ben Sigelman
Ben claimed tracing is becoming more indispensable than ever, as the number of
possible root causes for problems surfaced in metrics that describe user-visible
symptoms grows with the number of cooperating distributed systems involved in
the transaction. If monitoring is about telling stories, then a microservices
architecture has many storytellers, but we still want to get to the “why” part
as fast as possible.
Classic tracing shows where a transaction
spends most of its time, but not why. Ben hypothesised that contention for a
resource is the cause of most latency issues and proposed assigning a contention
ID to each resource and adding all CIDs encountered during a transaction to
span data collected by tracing. Then, for a highly contested resource, it becomes
possible to backtrace and aggregate to find out where do the calls blocked on
the resource originate from.
Ben advocated for standard tracing format (OpenTracing) and said that a lot of
the tracing data might be discarded unnecessarily; out of the possible resources
conserved by discarding or sampling the data (CPU, network, and storage) only
storage has a significant cost. If other resources are not constrained, sampling
“in process” becomes unnecessary.
Getting metrics at Firefox—Laura Thomson and Rebecca Weiss
Obtaining detailed usage data from an open-source project committed to protecting
privacy is hard. Laura and Rebecca talked about Firefox’s path to faster and better
methods of gathering data.
Firefox used to build dedicated systems to answer a single specific question, it
was very slow. Now, there is only one collection system that is self-service for
users and prioritizes reliability and consistency to precision of the data.
One of the datasets collected is the Firefox hardware report,
providing data on the capabilities of computers used by Firefox users.
Real-time Packet Analysis at Scale—Douglas Creager
Similar to Guy Cirino’s talk, Douglas also talked about obtaining network related
metrics from edge devices–in this case, for Google’s services. However, and unlike
Guy’s talk, this talk focused on specific methods of collection (tcpdump, libpcap)
and interpretation (tracegraphs, latency/RTT, bufferbloat) of protocol-level metrics.
While collecting headers only does not result in an overwhelming amount of data
(and helps with privacy and not looking at encrypted payloads), it is still not
necessary to gather detailed data on all connections. Google biases sampling toward
customers experiencing issues and captures a lot of detail for sampled connections.
Deriving conclusions from the data is automated, and the results are sent into
the monitoring stack.
Instrumenting the Rest of the Company—Eric Sigler
Eric’s unique perspective is derived from jumping between engineer and manager
roles throughout his career. As an engineer, he is used to answering questions
with data, but managers in meetings do not necessarily use data to justify business
actions. “We have a problem with X, so we will do Y”, but why will Y help
and how will success be measured?
“Without data, you are just a person with an opinion.”
“If we have data, let’s look at data. If all we have are opinions, let’s go with mine.”–Jim Barksdale, former CEO of Netscape (thanks to @marcus in Monitorama Slack)
What are some examples of metrics that can be used to answer business questions?
For example, the time a CI system is waiting to start a job could indicate if there
are enough build agents. The time a pull request remains open can point at
velocity trends. JIRA reports (tickets closed) shows overall behavior of developers,
but does not say why or where the bottlenecks are.
Eric closed his talk with a reminder of a difference between machines and humans: machines
do not care which metrics are measured, but a business could choose a metric that
incentivizes wrong behavior by employees.
Whispers in the Chaos—J. Paul Reed
This talk was very interesting to me, since I work at PagerDuty and incident
management is our bread-and-butter. The talk presents some real-world heuristics
and processes for engineers managing an incident, based on a thorough investigation
of a real incident that happened at Etsy in December 2014. A significant part of the
talk reviews the work done by John Allspaw for his MSc thesis.
According to Paul, engineers use the following heuristics:
- Change. What has changed since the known-good state?
- Go wide. When no relevant changes are found, widen the search for the problem to any potential contributors.
- Convergent searching. Confirm or disqualify a theory by matching signals or symptoms to…
- a specific past diagnosis–a painful incident from the past
- a general and recent diagnosis–a recent incident (“Maybe X happened again”)
Expert engineers, based on their accumulated knowledge, can quickly:
- determine typicality of the incident (is it a familiar problem? How bad is it?)
- make fine discriminations (seeing fine detail others miss)
- use mental simulation (what caused it and does it match what we see?)
- apply higher level rules
How does one develop the expert knowledge?
- Personal experience. Being challenged, “taking the pager” and doing on-call work.
- Directed experience. Tutoring, training, code review, pairing, reading runbooks. Being able to tutor others, which requires deep understanding.
- Training and simulation. Chaos engineering, game days, Failure Fridays.
- Vicarious experiences. Especially bad or good events that become stories.
Monitoring @ CERN—Pedro Andrade
CERN does a lot of hard science which generates large amounts
of data in realtime, so Pedro’s talk on the setup of capturing and processing this
data was quite interesting.
CERN runs 14,000 servers in 2 data centers with tape backup. They use Flume
(both push and pull) and aggregate all data through Kafka. File-based channels
are always used, for persistence and not for capacity reasons (that’s what Kafka
is used for).
Kafka details: the software has been solid. Data is buffered for 12 hours.
Each data source is a topic, 20 partitions, 2 replicas/partition. Processing of data
is via Spark (enrichment, aggregation, correlation - stream; compaction, reports - batch).
All jobs are written in Scala, because in their experience this is a language
well matched to Spark. Jobs are packaged as Docker containers, ran in Marathon or Chronos over Mesos.
Marathon worked well for them; Chronos was not as successful.
The long-term storage hierarchy flows from Kafka into Flume -> HDFS (forever), ESS (1 month),
InfluxDB (7 days; binned and reduced for farther away past). Two ElasticSearch instances
are used, separating metrics and logs; a separate index is created per data producer.
In InfluxDB, one instance is allocated to every producer. Data is accessed using Zeppelin, Kibana, and Grafana.
Monitoring infrastructure is running in Openstack VMs, except for HDFS and
InfluxDB which are on physical nodes. Puppet managed configuration.
The Future of On-Call—Bridget Kromhout
On-call experience will not be significantly improved merely by better deployment tooling; investing in overall
system architecture, better observability, and developer/operations culture is a
more rewarding path forward, claimed Bridget in her talk that concluded the second
day of Monitorama.
Some specific points of culture raised were to build trust and informal links
between teams so that they will openly share with each other what is really going
on. Advice was given to the product (or project) managers as well, to not hurry
so much to move on to the next thing just because the current project appears to
work, for some values of “work”; and to avoid hype driven development, to think
deeply about the problem at hand and whether the proposed solution actually would
make it better (to me, this further reinforced the point raised by Eric Sigler about
gathering business metrics and data to inform decisions).
Lightning talks are brief 5-minute presentations. Here are the quick summaries.
Brant Strand: a case for professional ethics and protecting user data for our profession.
“You have to fight for your privacy or you will lose it”.
Logan: Optimizing yourself for learning–how to improve memory? Knowing which
questions to ask, practicing retrieval of previously learned information. Taking
time to praise each other for working hard and solving difficult problems.
Christine Yen: Pre-aggregated metrics are bad at answering new questions. Metrics
overload–sometimes there are too many to grasp/make sense of them all.
Eben Freeman: using Linux perf (probe, record, report) to investigate hard problems.
“When measurement is hard, diagnoses are circumstantial, and weird stuff goes unexplained”–I liked
this statement because it also goes the other way; if things in the infrastructure
are diagnosed with uncertainty and weird stuff is dismissed as “just the way things are”,
perhaps more monitoring is in order.
Joe Gilotti: Iris, an incident management tool developed in house.
Me: DNSmetrics, a tool to monitor DNS service providers.
Tommy Parnell: it was just after my talk so I did not take notes. Sorry!
Scott Nasello: on ChatOps, which are good for onboarding new team members since
they can observe interaction and customers. Make people write a new ChatOps command
on day one.
Monitoring in a World Where You Can’t “Fix” Most Of Your Errors—Brandon Burton
Hailing from TravisCI, Brandon was facing a particular monitoring challenge: many
of the issues raised by TravisCI builds do not originate from Travis infrastructure,
instead they are GitHub / Python / RubyGems failures. Monitoring for these issues
has to rely almost exclusively on log output produced by builders and searching
that output for problems.
One question raised in the talk is whether making a stream of text (with potentially sensitive contents)
indexed and searchable increases a risk to users, in addition to making monitoring
easier. The architecture flows the log data from RabbitMQ to PostgreSQL to
aggregation and archival; ElasticSearch is used for searching.
TravisCI runs a lot of non-building infrastructure on Heroku and claims they have
fairly few long-lived, “owned” (I guess this means actively managed?) machines.
Amy’s talk focused on users of monitoring systems, talking about the principles
followed when building a custom metrics visualization tool at Pinterest to make it
user-friendly but powerful enough for experts. Many of the tips
and insights provided by Amy generalize easily to any internally built tool, whether
brand new or evolving, and I thought this was one of the more insightful and
broadly applicable talks this year.
No one wants to use monitoring tools. If they do it it means something is wrong.
A good monitoring tool is aimed at both experts and non-experts; not everyone is,
or should be, an expert user at interpreting monitoring data. Amy suggested the
following guidelines to make the system non-expert-friendly:
Not overloading users with too much information.
Best practices: using domain expertise to determine the most helpful default behavior. For example, making simplifying assumptions that almost always make sense, such as always spreading an alert over 10 minutes instead of immediately alerting when a threshold is crossed.
Making it hard to do the wrong thing.
Making it easy to try things without fear of breaking stuff. Exploration is important because we don’t know in advance what questions people might want to answer with the data.
Some features are appreciated by all classes of users. Performance is always
valued, because everyone wants to reach actionable conclusions faster;
do whatever it takes to make tools fast. For monitoring, this means rolling
up data over longer time ranges; caching data in memory, for example with Facebook
or adding a caching layer like Turn’s Splicer.
Frontend UI tricks can be a quick way to add speed and can be as simple as not
reloading existing data when bounds change, or lazy-loading pieces of content when
they are about to come into view.
Advanced/expert users would appreciate some accommodation as well. One way to do
this is by switching to advanced UI mode if advanced stuff (like braces, or commands)
are typed in. Max’s comment: An example of simple/expert UI can be found in Datadog,
with both a graphical and a free-form query entry UI available with a quick switch on the Monitor screen.
Having a tool that looks good also, perhaps unconsiously, promotes trust in the system.
Invest some effort into a polished user experience.
“I would say Grafana is aesthetically beautiful, and our tool is not,
and that is so important. if the tools are ugly, people trust them less. It’s sad,
but it’s true! if the margins are inconsistent or things look off, the trust is gone and people feel anxious.”
I asked a follow-up question about soliciting initial feedback from customers while
building the tool:
“Good question! I would say release small, incremental changes over time. We
didn’t release all of the things we built all at once–we slowly changed things
we thought were painful, and then waited to see if people commented on it (and
a lot of people sent messages saying “I like this new thing!” and that’s how we
knew we were on the right track)… The benefit of being an in-house developer
is that you can walk right up to your customers, show them your ideas on a
piece of paper, and ask them what they think.”
A few numbers: Pinterest logs 100 TB/day at 2.5 million metrics per second. The
engineering org numbers around 400. They considered Grafana plugins but decided on
a custom tool.
Automating Dashboard Displays with ASAP—Kexin Rong
ASAP is an algorithm for automatic smoothing of noisy data. The main challenge is
removing the noise, the short term fluctuations, while retaining “outlyingness”,
or longer-term trends–the accurate term used to describe that is “kurtosis”.
The algorithm has been implemented in Macrobase, a system for diagnosing anomalies,
and there is a graphite function for it.
The talk is based on this paper by the presenter
and Peter Bailis.
The End of User-Based Monitoring–François Conil
I did not take any notes for this talk, sorry!
Consistency in Monitoring with Microservices at Lyft–Yann Ramin
Yann described Lyft’s infrastructure and monitoring setup. Lyft has three core
languages: Python, Golang, and a PHP monolith. There are shared base libraries
for each language. Different services use different repositories. Configuration
management of the base layer is done with masterless Salt.
Lyft has an infrastructure team that enables others, not operates; the company
does not have dedicated Operations or SRE teams.
For monitoring, statsd is used as collection protocol; a modified version of
and statsite are used to process and pipe
the data into time series databases. The default sample period is 60 seconds; per
host data is processed locally, with central aggregation done on service level.
Critical to Calm: Debugging Distributed Systems—Ian Bennett
Ian offered a framework of two processes to calmly debug complex systems, with
some examples of actual issues encountered at Twitter.
The first process is the measure -> analyze -> adjust -> verify loop. The loop
is done for a single change, not multiple, because inferring which of the
multiple changes is causing the needle to move is much more difficult than
trying one thing at a time.
The second process is the “onion” of investigation, with progressively deeper
level of effort and involvement in the system being worked on: metrics,
tuning, tracing and logs analysis, profiling, and custom instrumentation/changing
the service code to test a theory.
The onion should be descended one layer at a time: test a theory on a particular
layer, verify. If no further progress is being made, move down a layer. The same
process can and should be followed for critical issues (“the pager’s angry”).
Common issues encountered: logging; auto-boxing in Java; caching; concurrency
issues; and expensive RPC calls.
Ian calls out Finagle for being great for composability and debuggability: your
service is a function, easy to add filters between functions and inject tracing/investigation code
Managing Logs with a Serverless Cloud–Paul Fisher
Lyft manages their logs using a mostly serverless pipeline in AWS, which is a rather
remarkable setup, and Paul explained how it works.
Logs are piped into Heka (end-of-lifed by Mozilla, but still works fine), collected
by Kinesis Firehose as JSON. Firehose flushes into S3, which is the permanent archive
and is fairly cheap (2-3 cents/GB). 10 GB/minute ingested with about 20 sec lag.
Ingestion into ESS is done on Lambda. Since CPU for Lambda can be scaled, Lyft
can pay more for faster execution and get better throughput, resulting in
a constant cost-per-item. Failures are written into SQS and back-filled by
successful jobs. Individual services are rate-limited in DynamoDB. Logs are
searched with Kibana. Reverse proxy in infrastructure to authenticate before
htting ESS (which has no VPC support). ElastAlert, written by Yelp, for alerting
and regex runing on logs; the entire pipeline is completed under one minute.
Lyft employs structured logging with a message and KV API. Max’s notes: it does
not have formatting into human-readable text, printf-like. Would that be useful?
The reason is that running regular expressions to extract KV from logs continuously
to generate real-time metrics is quite expensive.
Using the cost-efficient solutions for different purposes:
- stats for metrics
- tracing for debugging/sampling
- logs for incident triage & debugging. Do not log anything when everything is fine.
Some notes on ESS scaling: the original ESS cluster had 10 r3.8x data nodes,
3 m4.l masters. 10K JSON docs/average. However, managed ESS did not scale well,
causing support issues and with the optimal solution requiring specialized
clusters/node types for different types of data: hot/warm/cold nodes for ingest,
rollover, and forcemerge.
Distributed Tracing at Uber Scale—Yuri Shkuro
Yuri told the story of rolling out tracing at a large company. How large? Uber
has between 2 and 3 thousands of microservices. “At some point we had more
services than engineers, so we had to hire our way out”.
Adding tracing throughout the system is hard, expensive, and boring.
Tracing is both a newer and a harder problem than metrics/logging, specifically
because the tracing context needs to be propagated throughout all the microservices
on the way to be useful, from input to output of each. Yuri called this the
in-process context propagation problem. Thread storage can be a solution for
some languages, but Golang has nothing implicit that can be used. So explicit
context for languages like it. OpenTracing to the rescue (the other option is
vendors that monkeypatch APIs to add “transparent” support, which can be expensive
for vendors and therefore is expensive for customers).
How do we convince the organization to instrument their microservices? It’s the
2017 travelling salesman problem, and like the original it is NP-hard.
The benefits of tracing: transaction monitoring, performance/latency optimization,
root cause analysis, service dependeny analysis. For Uber, service dependency
analysis is the most important use case. Who are dependencies, what is the
workflow–attributing traffic coming from a service’s consumers to actual business use cases.
Another feature not directly related to tracing but very useful: propagation of context,
also called “baggage”. Using baggage, shared services can work in a multi-tenancy
model (test/prod/etc.), carry http headers, auth tokens, do request specific routing.
A good “baggage” system is designed in a way that services do not need to be aware
of its structure.
To sum up, tracing allows:
- performance tests
- capacity/cost accounting and attribution to business goals
- baggage propagation
All of these are “carrots”, and eventually (with adoption, and because a non-compliant
service breaks the tracing chain) these become “sticks” forcing stragglers to catch up.
Adoption is measured not only just by receiving some tracing metrics from the service,
but also accounting for correct propagation (for example: there is a caller but no spans
are emitted by the service; there are no routing logs, etc.)
The specific tracing implementation is Jaeger,
developed and open-sourced by Uber. It is Zipkin-compatible.
How many people use tracing? @opentracing did a survery and got 42% of people/companies,
but it is biased to people who are already interested… A quick show of hands–
approximately 10-15% of attendees at Monitorama.
Kubernetes-defined Monitoring—Gianluca Borello
In addition to being a comprehensive demonstration of SysDig’s features, the
talk by Gianluca raised a number of interesting points about the role of SRE team,
particularly a constrained one. It also talked about the features
a monitoring tool should have to be effective in a container environment, and about
monitoring-as-code in Kubernetes metadata.
When a SRE team is constrained (and in general, for optimal SRE scaling):
- the team does not do instrumentation. Monitoring tools should do this.
- the team does not help in producing metrics, unless they are deeply custom. Either the monitoring tool does this, or the service owners should be able to easily add their own.
- all metrics must be tagged without SRE involvement, ideally by using some metadata that is interpreted by metrics capturing tooling.
- there should be an easy way to direct metrics, alerts, dashboards to relevant teams easily. Metadata helps here as well.
The same great things containers allow also make containers more opaque. A container
aware monitoring solution can help by reaching into the container from the host (the
other option is explicitly configured monitoring sidecars) and tagging/grouping data
both by container and by higher order primitives, such as Kubernetes pods, automatically.
SysDig used the free-form Kubernetes job’s metadata.annotations field to specify
monitoring as code: define team members, dashboards, and alert specifications.
Some metrics of the event:
- 614 attendees, 582 on waiting list this year.
- 1146 people on the conference Slack.
- 50% women speakers and aiming for 75% next year.
- Up to 800 clients on the conference WiFi.
- 5,000 live stream viewers.
- 1 underground fire in downtown Portland.
Other Summaries and Reports