Large Installation System Administration (LISA) 2019 conference notes | MaxVT's Home On The Net

Large Installation System Administration (LISA) 2019 conference notes

“Moving everything to the cloud” is currently in fashion. This trend makes it easy to superficially dismiss LISA as a remnant of the dinosaur datacenter age that should only be interesting to a handful of SREs working for the major cloud providers. But the attendance and the organizers’ selection of topics show that the conference remains lively and, at three days long, has plenty of time for in-depth learning and discussion.

Overall, LISA presents an interesting mix of subjects. Focus on operating datacenters was particularly felt in the vendor area: novel high-density servers, security, networking, backups, and plenty of competing converged storage solutions aimed squarely at that market. The program, however, was light on hardware and heavy on SRE practices and DevOps Days-style culture areas (team alignment, communication, incident management, and so on). There were meaty learning workshops (performance investigations, Linux internals, useful tools), plenty of Kubernetes material, and even a small sprinkling of serverless. In hallway conversations and birds-of-a-feather, there were plenty of sysadmins and datacenter managers, as well as cloud-provider folks and large-company folks, as well as university people and students. I felt that there were fewer startups and small-company related discussions than I’m used to when attending other DevOps-focused events.

If you don’t have much time to look at everything, here are my top three topics from LISA:

  1. eBPF is here and it is a brand new class of software. BPF permits inspection of the internals of the Linux kernel and can dive deep into the interactions between a particular user-mode program and the operating system. Performance is just one area of BPF’s potential application; security, non-invasive monitoring, and other uses are certain to emerge. I had a taste of the power of in-kernel, JIT compiled code back in 2014-15 with systemtap, but that setup felt very brittle; BPF feels more stable and is supported on a wider range of distributions. Configuration management and monitoring of BPF programs is nearly non-existent at this time, despite a few companies already running tens of BPF programs on every production machine. It is worthwhile watching this space closely.

  2. The ratio of systems to engineers keeps going up, so it is inevitable that automation will take more and more operational duties and go “up the stack”, orchestrating and coordinating distributed services instead of individual machines. However, do we really know how to combine active automation systems and humans trying to respond to an incident when things inevitably go sideways? This is the subject of Revisiting “Automate All the Things”.

  3. The notion of microservices architecture enabling self-sufficient teams consuming platform services “over the wall” and with minimal coordination with other teams is dead. A microservices architecture in fact requires more coordination than ever. It is true that engineers can deploy code faster and fix issues sooner, but the cost is complex and evolving infrastructure needed to host a flexible network of services (with dedicated teams operating it), as well as more coordination at team and service boundaries. When no team works alone, communication and team alignment is more important than ever. Fuzzy Lines: Aligning Teams is a good starting point for learning how to communicate better with everyone you depend on and everyone who depends on you.

The rest of the notes are grouped by topic. In a three-track event, I could only cover a fraction of the content, and the selection reflects my interests.

Teamwork and People

Incident Management

Operations

Monitoring

Platforms

Security: In Search of Security Shangri-la

The entirety of the conference materials have now been posted: individual talk links on the program page lead to slides and recordings, where available. I looked for other in-depth reports on this LISA and found only one by Chastity Blackwell. Take a look at it as well for a different take on some of the same topics and a different selection of talks.

Teamwork and People

Fuzzy Lines: Aligning Teams to Monitor Your Application Ecosystem—Kim Schlesinger and Sarah Zelechoski

Kim and Sarah’s spoke about collaboration between dev and ops teams, but the lessons generalize easily to any two (or more) teams collaborating. For example, compare the suggestions below to the actionable steps of Rich Smith’s security team talk, and find similarities.

A good chunk of the talk was devoted to building a shared team culture, which includes:

A team narrative:

  • history: where do we come from, why do we need change
  • vision: what will we work on, what is the better future
  • shared understanding: everyone should be able to tell the narrative

Shared team values: how we work/treat each other, not skills/capabilities. Keep it really short to be memorable (an example: “roll with rick” - respect, inclusion, compassion, kindness). Require commitment and participation. Are these values part of the formal evaluation and compensation process?

Reinforcement: have consistent and well-known short prompts to reinforce culture, both positive and corrective. Fairwinds uses a “bee” for exemplifying company culture. The opposite is “we don’t do that here”; intended to be non-confrontational and start a smaller conversation.

Collaboration between teams is facilitated by shared tools, relationship building, known processes for the teams to interact. Guiding principle: “teams are thoughtful of and accountable to each other for their actions”.

Processes: define team rules; how are the responsibilities shared? Where the lines are fuzzy, help each other (we’re both responsible vs. nobody is responsible)

  • weekly syncs between teams
  • share impact and value of work being done

Tools: define the desired outcomes first, then choose the tools.

  • use a shared monitoring platform; they use Datadog.
  • manage monitors as code, they built their own synchronization tool (yaml to Terraform to Datadog).

Distributed Sys Teams—Sri Ray

Sri offered a collection of tips for building distributed teams (“remote” has a slightly negative connotation). A lot of the tips boiled down to treating remote and local employees equally:

  • choosing tools that allow both real-time and asynchronous collaboration
  • moving important conversations to asynchronous media (email)
  • respecting time preferences and timezone differences
  • scheduling meetings at a time most convenient for all, perhaps resulting in a time less convenient for the US
  • equal distribution of swag (stickers, shirts, and so on)
  • having remote co-workers report to the same managers as their local teammates

A number of suggestions addressed socializing and building non-work-based relationships:

  • telling remote folks about what’s happening (what’s for lunch etc)
  • having group gathering for fun, learn about each other
  • understanding that not everyone is equally willing to mix work and personal life
  • allowing remote team members to participate in some fashion in centralized activities (e.g. send popcorn and movie passes for movie nights)
  • having physical get-togethers: “air travel is surprisingly unexpensive these days for a large company”

Some accommodation for asynchronous nature of work and more varied culture might be necessary:

  • over-communicating intent
  • some cultures are less likely to voice their opinions. Hierarchies are more real/unquestioned in some cultures than others
  • sarcasm might not be well understood
  • feedback loops get slower; code reviews may take longer
  • translation overhead: sometimes, text might be preferable to talking because it avoids live translation effort.

One interesting question that was left unanswered was the best compensation strategy in a global market (there are multiple possible approaches, and Sri didn’t offer a preference either way).

Running Excellent Retrospectives: Talking with People—Courtney Eckhardt

“English is a pretty blamey language”

This was a workshop on how to run a retro, how to create a conversation, and how to run a good meeting in general.

A retrospective runner should:

  • facilitate conversation
  • work towards a productive meeting
  • not screw up by making bad jokes

Some common words can be problematic. “You” creates an oppositional conversation. “Why” is a request for justification. Instead, try to evoke thoughtfulness and discuss complexity. Ask: how, what/what if, could we, what do you think, what would you have wanted to know.

“Human error is not a root cause” - Allspaw: human error is where you should start your investigation. Miller’s law: “to understand what someone is saying, assume it is true and try to imagine what it could be true of.”

Storytelling for Engineers—Bradley Shively

Bradley spoke, primarily, about writing better email. Engineer communication is improved with context, analysis, and insights because there isn’t a lot of shared knowledge between people. Like any good writing, engineering communication progresses through identifying the audience; figuring out the story; and telling the story with context.

An improved email template:

TLDR and result. if you care about X, continue X is now changed in aspect Y due to Z (image) details for super interested…

The talk was exclusively about long-form communication (email) while more and more workspaces are actively moving day-to-day technical communications away from email into Slack (and competitors), which have different patterns of communication. It would have been interesting to see more discussion of that.

Incident Management

When /bin/sh Attacks: Revisiting “Automate All the Things”—J. Paul Reed

The first half of the talk recapped Allspaw’s thesis (conclusions based on a thorough investigation of an incident at Etsy in 2014) and was a shorter version of the presenter’s talk at Monitorama 2017 (the link leads to my notes). The rest of the talk discussed the interaction between automation and incident management, and to me that discussion felt timely because the growing scale of our systems means automation is growing steadily more prominent.

Ironies of automation, Bainbridge:

  • Manual skills degrade when not used (autopilot and flying airplanes by hand)
  • However, a working, up-to-date knowledge of the system is required to handle incidents
  • Generally, automation means a speed vs. correctness tradeoff - we can never validate correctness of automation in real time
  • Automation can hide system state
  • Tracing the decision for automation is difficult or impossible (AI, ML, black box automation providers), which leads to current context being indecipherable when paged.

Automation does not possess directed attention, redirectability/flexible focus, or predictability. All these qualities are used by humans to cordinate actions. Connectivity != coordinations. Is automation a deterministic machine, or an animate agent capable of independent action? Automation is just beginning to participate in our cognitive systems (see aviation/medical fields, they are further along in that space).

Chaos engineering helps us explore the discretionary space and “buy time”:

  • conducting chaos/DR exercises
  • expanding our understanding of the system as daily practice
  • evolving automation together with the product; automation is a product, separate from and related to the system it supports.

Antipatterns:

  • review only failures, not successes
  • biases (hindsight bias/counterfactuals/correspondence bias/…)
  • deprioritizing retro/learning processes. You must create time and space for expertise. Lessons from firefighting show that if a retrospective is not conducted within 72 hours, memory is lost and the context reconstructed afterwards would not be the original context.

See also:

What Connections Can Teach Us about Postmortems—Chastity Blackwell

Connections is an 1970’s TV series digging into the history of some key technologies of the modern world. The whole series is on archive.org. The key difference is representing history as a series of unlikely connections instead of “history developing in a straight line”.

Chastity’s talk was about postmortems and how they are similar to Connections because incidents are generally non-linear as well. Postmortems need to balance accessibility of results and complexity of systems being discussed. If there’s no balance, everyone skims postmortems to action items, and they are never read again—write-only memory.

For these documents to be read, we must make deliberate choices on what needs to be omitted. The goal of the story is to remember what is important, the connections - not the details. Writing well is a skill engineers need to cultivate and few people do. Using informal language, avoiding jargon, avoiding memes and gifs helps.

Avoid trying to write the story until after the investigation is complete. This is hard because our brains tend to build narratives automatically. One way to do this is to start with a timeline of events, without trying to assign any reasons.

To sum up, highlight surprising connections. Show key systems and people. Use the perspective of people who were involved. Provide context to show the complexity of the incident, and write in a way that is easy to read.

Earthquakes, Forest Fires, and Your Next Production Incident—Alex Hidalgo

Alex covered the history of Incident Command System (ICS), currently used nationwide by the US, and application of ICS to managing operational incidents in technology. ICS solves common incident response problems: lack of insight; bad communications; no hierarchy; freelancing (engineers trying to fix issues by themselves without any coordination with the larger incident management efforts).

An incident needs one Incident Commander (IC). The only role that really must exist.

Other potential roles:

  • Operations lead: the person in change of making production changes; nobody else should do so. Could be the person who first got the page. Actions should be documented for future review. The operations role can be further split into dbops, netops, sysops as needed
  • Communications lead: the only person updating things like the public status page
  • Someone who keeps the records (also known as scribe)
  • Planning lead: support and rotations.

Dissemination of information by text is preferred over voice; text is a widely distributed channel, so any stakeholder can easily join.

If you are wondering if you are in an incident, start the response. Don’t be afraid to declare an incident. (PagerDuty referred to this principle as “Don’t hesitate to escalate”.) Have a culture that acknowledges that stuff breaks–it’s just how it is.

See also:

Operations

BPF Performance Tools—Brendan Gregg

This workshop was a series of labs where we had to find as much as possible about a mis-behaving executable using BPF based performance investigation tools. Modern BPF is a programmable kernel. Programs are JIT compiled and run; kernel is patched at runtime. A verifier rejects code that is not safe, so a vendor would have a much easier time convincing Netflix to run a custom BPF program vs. a custom kernel module. BPF tools are also “production safe” - they do not stop the process and do not trigger health check failures that result in traffic being moved away from the machine being investigated.

BPF is a new application model: kernel mode apps via BPF API/helper calls. Facebook runs 40 BPF programs by default everywhere. Netflix has 14. How do we inspect, profile, and debug these programs, which are a whole new class of software?

Methodologies: “Eliminating unnecessary work leads to high performance gains”. A lot of times, understanding what’s running leads to a solution: many issues are “dumb load”, not complex algorithm issues – things that should not be running or that unintentionally do unnecessary work that kills performance.

Lab 1

Reads queue behind writes, so write latencies can flood the disk and manifest as read latency issues. Generally, Brendan investigates syscalls when syscall time is up.

Dealing with context switches is complicated, easier to trace on the app level.

Overhead is 65ns per event at least for BPF. For super frequent calls like CPU scheduing, this adds up. runqlen samples, not traces, so overhead is predictable and minimal. 99 times per second, not 100, to avoid lockstep sampling.

In the kernel, tracepoints dont (generally) change. hooking internal kernel functions requires regular maintenance as those are not guaranteed to stay the same (kprobes). Experiment with exposed hook points; you can attach to all k:blk_* and figure which ones are of interest.

A performance investigation should result in actionable items, not reams of data from BPF. “buy fast disks” is an actionable item.

  • syscount - system call count.
  • setting ext4slower threshold to 0 shows all calls. In this case we could see large writes followed by a very slow fsync.
  • tcplife is quite popular, with low overhead because it’s only tracking kernel socket changes (state changes to established or closed).
  • netflix flowsrus (not public yet): run tcplife everywhere, dump to s3, correlate and build directed graphs of connections.
  • runqlat: primary metric, proportional to pain. runqlen: secondary metric. workload characterization: is it a queue length problem?

Lab 2

To analyze performance, you need to be able to create workloads reliably; if a workload is caused by a manual operation, stick it in a loop.

  • execsnoop - a lot of stuff runs. but which one is slow?
  • statsnoop - specific files. CPU flamegraph.
  • exitsnoop

Lab 3

System is idle, but exhibits a runqueue issue? Only on one CPU, two threads. cgroups and containers will do things like binding tasks to CPUs, so need to be comfortable debugging these issues.

bpftrace syntax is fairly similar to awk. Brendan said that it can probably be learned in one day.

Jupyter Notebooks for Ops—Derek Arnold

Jupyter is a document format, a communication protocol, and a runtime - JSON, Websockets+ZeroMQ, and iPython. Jupyter can run 100+ languages and it’s possible to add new ones (“Jupyter kernels”). Derek showed us how to set up Jupyter (recommends using conda and venvs) and touched on some of the tools in the ecosystem (a multi-user notebook instance and same thing on a tiny scale, JupyterLab - an evolution of the Notebooks interface to be closer to an IDE).

The promise of using Jupyter was runbooks with better documentation, and an easy way to play with data that is retrievable with REST APIs (requests, charts, etc.) Unfortunately, there was not a single demo in the talk to demonstrate these concepts in practice.

Linux Systems Performance—Brendan Gregg

A full house for this talk. Brendan walked us through a lot of Linux performance tools (classic and modern), based on some examples.

Tools:

  • mpstat. 0-50% CPU: not competing on hyperthreads. Loaded systems can also have contention on CPU cache.
  • pmcarch: Intel-specific, summarizes cpu perfrormance counters. How efficient is the CPU running. LLC - cache hit ratio. when running on a saturated CPU, there’s more competition.
  • perf, to see context switches (in-kernel mode)
  • cpudist, how much time did the process spend on a CPU (bpf based)
  • uptime, for a second or two. Is there a CPU issue?
  • top/htop: Brendan doesn’t like the usage of colors for subjective or content/dependent metrics. Reserve color for pass/fail indications.
  • vmstat, iostat. First output is since boot, but not all metrics.
  • free now has an “available” column, because of confusing output before.
  • strace - very slow; will be replaced by “perf trace” (bpf vs. ptrace)
  • tcpdump: high overhead. Use bpf in-kernel summaries instead; still can have significant overhead to capture everything.
  • nstat, the new netstat (from iproute2)
  • slabtop, kernel memory
  • pcstat, page cache residency by file; database perf analysis
  • docker stats, cgroups by container
  • showboost, CPU clock rate (PMC/MSR based)

Some Netflix-specific performance troubleshooting notes. At Netflix, most analysis is done with user interfaces; SSH is considered last resort. But command line tools often have better documentation and more options not necessarily exposed by GUI frontends. All performance tools should be installed everywhere out of the box; in crisis mode, it could be difficult to build or install them. strace and other large-performance-penalty tools are not useful at Netflix because the added load/latency will fail health checks and trigger failover from the node under investigation.

Methodologies:

  • Linux performance analysis in 60,000ms
  • For every resource on the system, check USE: utilization; saturation; errors
  • Learn the characteristics of your workload, not performance
  • Who - pid, programs, users? Why - code locations, context? What? How is it changing?
  • off-CPU analysis: look at blocked time instead of CPU running time.

Profiling: flamegraphs are useful to summarize perf results. “If you aren’t doing it yet, start doing CPU flamegraphs.” In Linux 4.9+, summarization of flamegraphs can be done in kernel.

Tracing: advanced topic. Using ftrace, funccount (kernel function calls stats). perf can use tracepoints in the kernel. An example BPF program, ext4slower catches slow FS operations better than disk I/O metrics. bpftrace is an interesting way to write one-liner programs.

Brendan doesn’t like benchmarks. A lot of them are wrong and it takes a lot of effort to refute them. Active benchmarking: do performance analysis while the benchmark is running, and ask “why isn’t it 10x faster?”

See also:

Resource Management and Service Sandboxing with systemd—Michal Sekletar

I only captured about a half of this workshop, leaving for another talk afterwards.

systemd is an init process, service manager, basic userspace for Linux. Has numerous components, and the list is constantly growing.

The basic principle of systemd-based init: a dependency (relational and ordering) based execution engine. All managed objects are “units”. Lots of unit types, besides services. Unit specifications have location-based hierarchy; only the top priority is loaded (not merged). Explicit reconfiguration (systemd daemon-reload) is used. A unit is not the unit file. System V didn’t track state and didn’t have a general concept of units.

systemd-analyze is useful to understand the graph. systemd --test creates a boot transaction and prints it. systemd analyze verify can show how exactly dependency loops will be broken.

Relational dependencies: wants, requires, partof, etc. Relational dependencies do not impose any order; use after/before for that.

There is some transaction logic, when there’s work to be done (start a unit, for example). tasks (jobs) are created, optimized, and placed into a job queue.

See also:

Network Fault Finding System: Packet Loss Triangulation—Jose Leitao and Daniel Rodriguez

I thought this was about finding network issues on the Internet, but in fact this was fault finding within a Facebook datacenter, which was not as useful to me. One takeaway was that virtualized environments don’t do ICMP as well - it’s usually done in hardware on actual hardware.

Monitoring

Wide Event Analytics—Igor Wiedler

Igor raised some familiar problems with monitoring complex systems using logs, metrics, and traces separately from each other. His solution is to emit generic structured events, store, and query those at scale (I first heard about this approach at Monitorama ‘17 from Mukerjee and Majors). The reasons, benefits, and difficulties (particularly around cost of storage and speed of search) are unchanged from previous discussions.

The novel part of the talk for me was more detail about the technology of storage and querying of vast amounts of data at interactive latencies. The technology is called columnar storage, and is already used by analytical databases. The interactive usage requirement gives us a latency budget (10 seconds) and that determines the maximum amount of data that can be processed per query by a single node (10 Gb per disk).

A columnar DB greatly reduces the amount of data to scan because a typical query looks at very few fields out of wide events. 1 billion+ events per latency budget can be scanned. Compression is very effective for many columns; time-based partitioning of data also helps as most queries are scoped by time.

Not a lot of tools to do this kind of observability work exist, Igor said we need more tools in the monitoring space. Dremel and Scuba papers from Google and Facebook respectively.

Off the Beaten Path: Moving Observability Focus from Your Service to Your Customer—Mohit Suley

“Don’t ask [your users] to convert feelings into a number. It’s not the amount of spice in your tikka masala.”

Mohit proposed to expand the scope of observability (which is a superset of monitoring) towards the customer, beyond visibility into just the application stack. Application metrics have become our comfort zone–the number of nines and 75th/95th percentiles. But these metrics could be blind to users who fail to connect. The metrics don’t see users who can’t use your service because it requires fast internet, or non-customers that are still in the market for a solution to their specific needs.

Network error logging is a proposed spec to collect information about connectivity issues.

To observe and debug usability issues from the viewpoint of the user, Microsoft’s Clarity offers anonymized sesion replays, structured data, and behavioral analytics.

Gathering feedback: ask people how they feel. Perform sentiment extraction (Stanford CoreNLP) from freeform feedback. Have customers meet engineers. Talking to customers helps you see the things you are doing really well.

Mohit’s 5-point plan to improving observability:

  • listening tours
  • insider programs
  • sentiment feedback
  • product usability
  • service metrics

Platforms

The Container Operator’s Manual—Alice Goldfuss

It was not really a surprise that the first keynote of the conference was about Kubernetes. However, this year the air surrounding this tool feels a bit different. As the community gathers and shares experience with K8s, the hype is pushed out by a known set of benefits and drawbacks. Widespread adoption means that moving to Kubernetes in production is looking less like poking around a dark basement with a dying flashlight and more like following a fairly well paved road to assemble a set of interconnecting components.

A key drawback of K8s is complexity. “Containers need headcount” - operating a containerized cluster is a completely new venture and skill set. Containers should not be attempted as an add-on to an existing infrastructure team. Distinct areas of knowledge in operations, deployment, tooling, monitoring, kernel, networking, and infosec require a brand new team of 6-8 people. At the very least there should be 4, because they have to be on call. And the new team needs to be empowered and advocated for.

Complexity also means that a big bang, greenfield replacement is hard. Most likely, a company will end up with a hybrid infrastructure (some containerized workloads and some existing services), and it will take about a year to roll out the platform with a single leading service.

Containers themselves are simple: UNIX processes, running from tarballs, bound to namespaces and controlled by cgroups. The complexity comes from having to build out a set of services taking care of:

  • building the containers
  • scheduling tasks and allocating resources
  • managing networking / connectivity in a dynamic environment (5-6 different projects to do networking in K8s, and some of them use each other)
  • deploying new versions
  • monitoring
  • provisioning, presumably of physical hardware (Max: the LISA audience’s focus is showing here)
  • debugging

Alice mentioned the strengths (stateless applications, portability, unified deployment strategies) and some of the weaknesses (handling stateful services) of containers. For stateful containers, all architectures are network bound: even if storage is local, migrating data to another host (for redundancy or if a failure has already occured) is over the network. All 3 major cloud providers have non-Kubernetes solutions for databases that will work for most people. Vitess works on K8s if managed offerings are not good enough.

It’s possible to gain a lot of the benefits of containerization (experimenting, versioning, iteration) with a cloud vendor and multiple regions, and likely cheaper. Having only a few services, or a small team, or databases with nicknames are strong indicators to not do Kubernetes.

See also:

Deep Dive into Kubernetes Internals for Builders and Operators—Jérôme Petazzoni

“The easiest way to install Kubernetes is to have someone else do it for you”

Jérôme shared his practice prior to taking the Certified Kubernetes Administrator exam. The talk was very hands-on; I suggest watching it recorded. Kubernetes components were brought up one by one; starting from a minimal set, Jérôme would try things that fail, then bring up the component of the Kube stack to make the broken piece work (start with running some pods, then networking, etc.)

K8s is, at the core, a declarative system. A (syntax-checked) manifest is persisted to etcd; then controllers wake up and do “stuff” in response. All interactions write, via API server, to etcd. The only one who can access etcd is the API server.

A lot of operational tasks need to be properly automated to run K8s at any kind of scale. CIDR allocation, routing… CNI helps with configuring networking. A bunch of plugins invoked by a kubelet when it needs to do anything network related, and exact implementation can be left to the plugin.

Ops on the Edge of Democracy—Chris Alfano and Jules Schaumburg

“On the ground, things move a lot slower than in the cloud”

A contemporary tech company optimizes its software for fast, global scaling on top of fairly complex infrastructure. Chris argued that this pattern does not work well for non-profits and local governments. Local organizations want local customization, not a single product that mostly works for anyone. They prioritize local involvement and participation, not scale; and they have neither the technical expertise nor the desire to operate a complex infrastructure that is accidental to doing something the organization actually wants done.

There was a lot of overlap between ideas in the talk and Giving Done Right, a book I read recently, with respect to local differences, listening to needs instead of imposing solutions, starting small, balancing addressing the symptoms and addressing the causes, and so on. Being humble and listening would summarize it well.

Chris said a new toolkit/platform is needed to serve these use cases. Replicating a whole solution should be one click operation. Idling must be free. There should be no non-free dependencies, and the infrastructure should be owned by the community.

Jules spoke about the cost of friction and misalignment between systems used for public service, and the extremes of cost and time requirements to get anything done (24-36 months for a RFP).

I don’t recall hearing any solutions being proposed. The talk felt like a wishlist, and there was little specific critique (for example, “Wordpress is not good” was stated without any explanation of what was wrong). Some suggestions contradicted earlier arguments (community should own the infrastructure; but community doesn’t want to be running infrastructure, they want to solve their problems). The key question—how to pay for it all and who will maintain the non-proprietary alternatives to proprietary dependencies—also went unanswered.

There was a very relevant audience question/story of a migration at a government organization. The project manager has left, and the lead engineer was required to use all assigned staff regardless of their ability, burned out, and will never work for public service again. The question was how to address this dysfunction, and the answer spoke about the importance of the things we are doing. Joining a dedicated team (“Digital service” teams) could be one way to not take on the bureaucracy by oneself.

Security

In Search of Security Shangri-la—Rich Smith

Rich argued for changing the culture and approach of security teams, because the current approach is not working out. “Security industry generates FUD [fear, uncertainty and doubt] in order to sell hope”, but hope is not a strategy. An emphasis on fear, hoodies, and padlock icons does not serve end users well and does not make security team a good partner. Mainstream media coverage adds to the image problem by often taking the “technology was good, humans were at fault” angle, which has been thoroughly discredited as unhelpful by DevOps but that change has not widely propagated to the security field yet.

The change would involve a more open, collaborative approach. At Blackhat, Nicole Forsgren described the DevOps/security relationship as a marriage. Rich really liked that idea. By getting security working closer with other teams, responsibility and accountability are aligned. It also keeps the security team at the table: “if you can’t be worked with, people will work around you”.

The change would also involve a different interpretation of security (this part of the talk reminded me of Jeff Man’s “Rethinking Security” talk earlier this year, linked below). Security is not binary; it’s context dependent. It’s not a destination; it is a journey. It is also not static. A zero risk security solution is not possible. Instead, mindful acceptance of residual risk is at the core of security. As an example, measure “time to the first phish report” instead of “how many % of users clicked on a phish”. The latter is, game theory wise, a bad winning condition (one click is enough to lose).

A few actionable steps:

  1. Measure security teams by what they enable, not by what they block.
  2. Socialize. Rich’s security team had a $350/month order of candy, money well spent for interacting with other engineers and teams.
  3. A team that is open about what it does and why, spreads understanding.
  4. Blameless culture: learn and improve from what happened, instead of blaming. True causes of incidents will not be revealed in a blaming culture.

All of the above steps are helped by not hiring assholes into the team. if you inherit one, get rid of them.

See also:

Lighting talks

Some highlights from talks where I didn’t have enough material for detailed notes:

  • First things to worry about, operationally, when you don’t have SREs yet: SLI/SLOs; incident response rules; postmortems and learning (How to SRE without an SRE on Your Team, Amiya Adwitiya)
  • A wonderfully hilarious talk on SRE principles applied to cat ownership (SRE for Cats, Denise Yu)
  • Good qualities for a text mode control/shell: interactive, hierarchial context, intended for quick operations – not necessarily readability (An Overview of Network Shells, Michael Smith)
  • A 5-second-transition Ignite style talk (March Madness!, Alex Hidalgo)

30 Interviews Later…—Paige Bernier

A story of some things companies do wrong when interviewing, and a bunch of things that a particular company (Lightstep) did right in their process.

  • If there is no video call, is the company really remote friendly?
  • In-interview feedback is good; being upfront about not needing to know certain specific tech details is good.
  • Paige likes pairing with an engineer for a more collaborative-style interview.
  • It helps to have a clear schedule for onsites (interview titles, names and roles) in advance.
  • 2 participants from the company in each interview: interviewer and shadow (somebody training to interview), or to eliminate bias.
  • A nice small touch: a handwritten thank you card.
  • When discussing offers: provide leveling info + career / engineering ladder, be flexible with negotiation times.