DevOps Days Tel-Aviv 2017 conference notes
The Tel-Aviv instance of DevOps Days was interesting to me for two reasons. The first one is there’s a ton of innovation happening in Israel, so there’s a chance of hearing new things; yet Israel is far enough from Silicon Valley that its thinking and opinions are distinct from the San Francisco area monoculture.
The second reason is related: flight tickets within Europe and, to some extent, from Europe to Israel are fairly cheap but transatlantic flights are much more expensive, so there is a “moat” that many speakers and attendees will not cross.
The event was quite busy, with half the time spent in a 4-track format (3 talks and a full-day workshop) and the rest combined for keynotes, Open Spaces, etc. There was a great follow-up MeetUp on Day 2 with content quality as good as the conference itself, and shorter food lines :-)
A shared theme of the talks I attended was focus on people and processes instead of technology. Several talks covered managing and scaling a SRE/Operations team within an organization, as well as promoting better alignment and collaboration between the team and the rest of the company. There were plenty of thoughts on managing the hype and churn of new, unproven technologies and focusing on solving the problems of actual customers, be they internal or external.
Best talks:
- “How we do it” category: a clear winner, Guido Iaquinti with a double whammy of talks on scaling Slack’s Ops-related teams and their MySQL clusters.
- “Pointy-haired boss” category: Ben Rockwood with a complete blueprint on managing and growing a SRE team.
- Ignite category: Sharone on staying motivated when contributing to an OSS project that’s not featured on HackerNews.
Talks
Sebastien Goasguen—How can we survive continuous innovation?
Sebastien talked about dealing with the increasing pace of new stuff being created in our industry, driven by growth of R&D budgets (the four big tech companies have a combined R&D budget greater than that of most countries) and proliferation of open source from both volunteers and corporations.
It is not realistic to expect the workforce to keep up with what’s happening with the same rigor and depth that we were used to, so how do we find, choose, and adopt new software? One way is to follow viral adoption, from HackerNews etc. Sebastien asked who uses Envoy; 0 hands in the audience. “In Silicon Valley, everyone uses it now”, but he thinks it is clearly not mature enough, even by looking at the documentation of the project. This is “zero day software”–it just came out, so it’s time to put it straight into production. This might sound crazy, but this is the future.
The key is starting from customer’s actual problems, not from new technology, as DevOps says - talk to the customer. Same message is being spread by Kris Buytaert: Return of the dull stack engineer and Moby is killing your DevOps effort. No single piece of software is a silver bullet. We should have a DevOps loop for new tools as well; adapting to new technology, not necessarily adopting it.
Some people measure popularity with Github stars. Github stars don’t mean users, they don’t mean adoption. A project gets a ton of stars when a HackerNews thread goes crazy (however, Sharone later said stars are a motivator to open-source project developers).
Rami Goldratt—Coming up with the next Business Innovation Breakthrough
Rami spoke about strategic planning when the company is experiencing pressure from competitors. In business language, these tools come from the constraint management toolbox (demand side constraint: the business growth is limited by insufficient demand, inability to provide compelling value to customers).
As initial growth slows and a field proves viable and attracts competitors, a company matures and enters a “red ocean” (colored with the blood of competitors)–this puts pressure on both sales and investing too much into innovation. However innovation is necessary as the easy solutions (lowering price, reducing quality to maintain profit) often result in a death spiral.
There is an optimal number of innovation projects at any given time. Too much innovation leads to incremental improvements because of resource pressure (no single team has the resources to make major changes). Too little innovation leads to the same outcome because with only one or two ideas, companies usually won’t bet everything on risky changes, only small variations on the existing product’s theme.
A few things to try:
Look for value from customers’ point of view, ignoring what might be easy or optimal to do from the company perspective.
Exceptional, lasting value can be obtained by removing a significant limitation for customers in a way that was not possible before and is not accessible to competitors.
Find customers who spend a lot of money to satisfy a specific need - this is a visible “tail” that might point at a “dog”, a large number of customers with the same need but not willing to spend the same amount of money. This is “tail wagging the dog”.
Abby Fuller—Creating effective container images
“There are many talks on building containers, and each one says all the other talks are wrong”
Abby gave a bunch of tips and strategies for building better containers. App developers might not care if an image is too big or has less-than-optimal layering, but they should in a DevOps world, especially if they get paged for related issues (for example, disk space running out because of huge images). A number of tips were focused on minimizing the number of layers in an image by batching steps together, such as:
switching USER only at the end, for actually running the app. Each user switch adds a layer.
not ADDing large files by themselves; instead, downloading, processing, and cleaning up in one step.
building Golang images from scratch, since they need no dependencies. “FROM scratch, copy ./binary /binary, ENTRYPOINT “/binary””
avoiding unnecesary recommended packages and cleaning up package managers’ caches, with
-no-install-recommends
and&& rm -rf /var/lib/apt/lists/*
Docker caching is very useful and speeds things up both locally and on the network, but sometimes needs to be broken intentionally (when a statement produces different results based on things not in the Dockerfile). “CREAM: cache rules everything around me”. Two examples of using caches well were:
Caching dependencies for Node JS:
COPY package.json
andRUN npm install from package.json
Using a single shared base image (per technology stack, perhaps) in production
Abby quickly touched on a bunch of related topics: multi-stage builds, container security scan tools, garbage collection (spotify-gc), and continued work to reduce the size of community base images.
There were two specific things I heard that could use a nuanced interpretation: that more layers mean a large image (a layer does add overhead, but it is possible to build slim multi-layer images and huge single-layer images; the ideal layers are logical and optimize caching, not driven by specific Dockerfile steps), and a suggestion to switch to minimal base images like Alpine (the process may need to change quite a bit to use a minimal image, especially if installing additional stuff, anything that needs to be compiled, etc. – many minimal images make these steps harder).
Rebecca Fitzhugh—Design Patterns for Efficient DevOps Processes
Rebecca talked about optimizing release engineering processes. The optimization should cover both technical and human aspects: a continuous delivery pipeline for software and a continuous improvement process for humans. With DevOps, release process can also automate change management, by capturing change records from builds and tests.
Rebecca found it more useful to walk backwards from finish (shipping/deployment) to start (feature ideation). She does not rely on standard times for tasks & information not personally obtained, since it is not reliable. The focus is on bottlenecks, such as functional tests, and doing some steps only if necessary instead of for every change. Complex processes receive extra attention since complexity is the enemy of reliability. One way to make decisions is to apply the Cynefin framework to release engineering.
I found it peculiar that Rebecca’s process requires personally collecting all information to obtain good results; standard times and hearsay could be less accurate but it can provide the first iteration of the process. I also found it peculiar that she called the release engineering field relatively new, as I’ve been working with dedicated release engineers for 15+ years personally and change management itself must be dating back to industrialization days.
Guido Iaquinti—Distributed Teams: Scaling Operations Around the World
Guido’s talk was about evolution of infrastructure teams at Slack. They started from a single team including the CTO dealing with all issues, but wanted more people and had too many systems for one head. So Slack switched to multiple ops teams with partitioned responsibility, staying under 5 people per team, all in a single timezone. Then even more people were needed and Slack wanted to introduce follow-the-sun on-call, so teams became geographically distributed.
The current teams are:
- AppOps (dns, net, frontend, websockets),
- StorageOps (db, cache, search, queue),
- VisibilityOps (monitoring, logging, metrics, datawarehouse),
- Build & Release.
At the beginning, on-call was shared between some teams due to small headcount, then became team specific with cross-team escalation only for big outages.
The growth brought several challenges: sync across timezones, knowledge transfer and culture uniformity. They have several techniques to mitigate that, including keeping a log of notable infrastructure changes and outages (“Today in On-Call”), maintaining all runbooks in Github, and plenty of ways to engage people and keep them in the loop:
- Partnering with engineers on projects
- Architecture reviews
- Peer feedback as part of bi-annual performance review
- HQ visits
- Weekly 1:1s with managers
- Cross-team 1:1s with donuts
- Global recognitions, similar to our Reflektive
Some remote offices were started with all-new team members. That was not a good experience, so now Slack sends an existing team member to bootstrap new offices (for 1.5 years in the case of Australia). Guido Considers 3 people in an office not enough for sustainable on-call.
Liz Rice—Where’s my code? Securing your code when you don’t even know where it’s running
Liz talked about pipeline processes to build “cattle” servers via immutable images, and on new security features in containerized world. For containers, the patching process is different: new images are created instead of patching existing ones. Security policies and scanners can be applied on different steps of the container build pipeline. Not having a shell in a container, like Alpine, hinders attackers.
At runtime, containers allow segmenting & encrypting the network connections per task. Tools exist to identify and prevent suspicious behavior, like manual execution of commands. Servers that were attacked should be recycled, also recycling generally / on a schedule is a good practice because it makes provisioning testing routine.
Liz did a poll on orchestrator use. About 6/200 use Kubernetes; many more use some other orchestrators (Nomad, OpenShift, Swarm were explicitly mentioned).
Q: Your opinion on free security tools like Docker Cloud’s scanner?
A: Free tools are much better than nothing; but specialized tools like Aqua do better, for example on false positives.
My notes:
- Could it be valuable to be able to quickly turn off “preventing suspicious behavior” in case of an outage, where all kinds of “suspicious” manual actions might be attempted?
- Attackers already have various shell-like payloads sent as part of exploits, will it really stop them for long to not have a shell in a container? Will it merely change the usual hacking process to set up a shell payload everytime?
- Recycling attacked servers really only makes sense once the vulnerability has been identified and closed. Especially if it’s not a probabilistic issue that takes a long time, attackers might be able to exploit new images with the same issue faster than the infrastructure can cycle them (while keeping the service available).
Corey Quinn—Inventing Yourself: The Musings of an Assistant Regional Thought Manager
“Dressing up builds credibility, but so does not fucking up”
The overall theme of the talk appeared to be context. Different companies have different needs, blindly copying their solutions will not necessarily achieve a good result. Watch How Netflix thinks about DevOps – does Netflix way of free production access, trading uptime for velocity of innovation, and only hiring senior, experienced engineers work for your organization?
Cargo cult is a hope to get results based on incomplete understanding. “The islanders weren’t stupid; they missed the larger context”.
On audience responsibilities at conference talks: asking good questions. Not a good question:
- calling bullshit (that is neatly handled with Shmoo-balls at ShmooCon)
- telling an irrelevant story
- the resume of the person asking the question
Q: How do we tell people both sides of the story, and not just the positive outcomes?
A: Share some of the negative experinces. Flip side of Chaos Monkey: you’re testing on your paying customers. That’s fine for Netflix, but for banks, is this something to be experimenting with?
Q: on dressing up
A: It started for fun, for a week. My coworkers made fun of it like you wouldn’t believe. But they also started taking me more seriously.
Ben Rockwood—Leading DevOps Teams: Hire, Grow, and Manage
Ben is a Solaris expert who became less passionate about technology since tech changes rapidly, but people don’t change. The same team Ben built still operates things today but on entirely different technology stack. Some history of operations management is in Ben’s talk “The DevOps Transformation” from LISA ‘11.
Three points of view for managing a team: a system point of view, management tools, and employee lifecycle (average employee tenure is 2-3 years, so plan the approach in advance - churn is going to happen.)
System point of view:
- diversity (of thought) - one can have a very physically diverse team from the same background who all have the same ideas.
- vision - a common cause. “Aligned vision unlocks passion, energy and innovation”
- strategy - “plans are useless but planning is indispensable”
- management - system lifecycles, “big principle”. In smaller organizations, it is more important to be a leader, unlike big companies where it’s possible to simply accept the big organization’s processes. Reduce friction in processes and with policy.
Tools:
- LEAN
- problem solving: A3 (see Ben’s talk on A3)
- Kanban/Scrum: execution tools
- Cynefin
- ITIL and COBIT for operations
Employee lifecycle:
a model from The Social Workplace, focusing on:
- position & pay
- culture
- learning & challenge
- knowledge transfer
- organization / promotion
See also:
Joe Smith—A Culture of Automation
Joe is on the AppOps team at Slack and spoke about the process of launching things into production. “There are many job titles in the field, and all of them are generally about building resilient systems.”
At the planning step, Slack does coordination of multiple changes; plans a rollback strategy ahead of time; and gives teams tools & visibility to make improvements. For documentation, it is important to provide context: don’t just describe how systems work, describe why - this can help with future decisions. For communication, there’s a Slack channel for launching a feature where all related activity is done and streamed to.
On a large team, it’s not possible to keep track of everything that’s happening at once.
Slack uses Git repositories and Markdown to store runbooks in one global repo, with standard structure per service:
- service readme
- common actions: provision/recreate a box/kill; these are going away with wide adoption of autoscaling groups
- other actions
- for each action:
- about (when would you do it, what would be the result)
- sequence of steps
- A page for each of the monitoring alerts in alerts/ folder
At Slack, runbooks are a list of steps, not a place for design documentation of the service.
Approach to automation:
- make new people write the tools
- gradually convert runbook into scripts/tools, small pieces first then add more pieces and take care of special cases
- eventually the tool is run automatically, runbook is no longer needed
- libraries like fabric, pychef, boto3 are used at Slack for those tools
Some tools are available at https://github.com/Yasumoto/tools
Ignite Talks
Sharone Zitzman—When Your Open Source Project Stops Being Cool
When working on a project that’s no longer considered the new hotness, one can get depressed by all the people saying “uncool” tech is dead. Even contributors aware of trolls still experience the demotivation. Everyone is using free and public code, but there’s sometimes chaos behind the scenes of OSS projects, and good code does not maintain itself.
What makes an OSS project cool? Let’s redefine it. Like CERN running 280k cores on Open Stack; maintaining developer engagement for years; trusted by the biggest names; and having a stable business model that allows paid developers to work on the project.
More ways to get involved in OSS besides writing code. Github stars help maintain motivation.
See also:
- Roads and Bridges, a Ford Foundation report on software infrastructure
- Walkaway by Cory Doctorow.
Gil Zellner—From Ops to Dev and Back again
Gil built a deployment system at Gett, with basically the same feature set as Cloudify. A set of scripts is a very different thing than an actual polished product, however. Gil thinks devs should have some sysadmin experience, and the other way - ops people should have some experience developing applications.
Lior Redlus—How our ISP’s firewall cost us a full day of the entire R&D
This quick talk has a good walkthrough of Coralogix’s troubleshooting process. It is indeed quite rare that ISP is the culprit.
Lidor Gerstel—Container Orchestration with Rancher
There were about 4 people using Rancher in the audience. Rancher can manage containers and hosts and just came out with a new major version. 2.0 is a very different UI than 1.0; it also supports high availability in Active-Active mode. Rancher needs to run an agent on each host.
Keren Barzelai—Fake it till you make it
Three tips on faking it: vocabulary is the easiest thing to learn and fake; talk less and listen more; and ask questions.
Open Spaces
Serverless:
- Chalice - A framework for writing Python on Lambda
- Golang can be executed on Lambda via running binaries from a Python script.
- Any serverless stuff requires supporting services provided by the cloud vendor, promoting vendor lock-in.
Continuous delivery: nobody loves Jenkins, yet everybody uses it. I gave the only example of a stack that does not use Jenkins.
Embedded vs. centralized Ops team: some teams have both, but “you don’t get more ops people when you spread them around”. Embedded Ops run the risk of becoming just another dev working on product features, or a single point of failure if any ops-related work is directed at them. Avoid that by focusing on teaching and support from management (that embedded ops is not just another engineer who can solve customer issues).
Meetup
On the second day, after the conference wrapped up, there was a Meetup with pizza and a few more talks. I skipped the last talk since I already saw it at DevOpsDays PDX.
Philipp Krenn—The Se7en deployment sins
Philipp talked about anti-patterns, mostly around Docker and microservices. I liked the opinionated approach of Elastic: they do not provide a latest
tag for their containers (since stateful services should not be run at the automatically updating latest version) and Elastic Search containers can’t be run as root, for obvious security reasons.
There was a bunch of reading material for microservice architectures. Philipp gives 3 guidelines for when a microservice approach is desirable: too many people; too many dependencies; and a need to scale a small part of a system independently from the rest of the system.
- The 8 fallacies of distributed computing
- Hodges: Notes on distributed systems for young bloods
- Fowler: Microservice premium
Another anti-pattern mentioned was not separating code and configuration. If a new container has to be rolled to change a single parameter in how the service is run, that is quite inefficient–even if it appears blissfully dependence free.
Guido Iaquinti—Database systems at Slack: past, present and future
A great, detail-filled talk on Slack’s approach to relational databases.
Today:
- MySQL Percona 5.6 on EC2, multi-AZ for each cluster. Instance storage, no EBS.
- MySQL pair in Master-master, async replication. Availability over consistency.
- Unique IDs generated by external service - no autoincrement.
- Correct server to use determined by the primary key (odd/even split)
- ~96Gbps peak DB traffic
- 100’s of servers, 100’s of TBs
- In-house sharding topology
- Originally a LAMP stack, migrated to HHVM
Availability is good. Horizontal scaling is possible by splitting hot shards. Writes are not delayed by replication. “Online” schema changes, sometimes expedited by pulling all query load to one side of the cluster and ALTERing the other.
Challenges:
- A (Slack) team can’t grow beyond a single sever pair.
- Replication thread is the bottleneck; overall hardware utilization is relatively low.
- Read replicas do not improve things in this setup.
- Requires statement based replication.
- Occasionally need to manually resolve inconsistencies.
Tomorrow:
- Vitess, built on top of MySQL and InnoDB. Vitess has a set of routing/front end machines (
vtgate
) backed by service discovery (etcd, consul, zk) and a shim in front of each MySQL instance (vttablet
) that does query deduplication and other goodies. - A variety of sharding strategies is available.
- AWS i3 instances with NVMe storage
- MySQL 5.7 with GTID and semi-sync replication
- Ubuntu 16.04 with 4.4 kernel (this is standard on 16.04)
What went well/badly for this migration?
- More stable, but slower (2ms vs 1.2ms) than existing architecture (more network travel)
- i3 instances are 60% cheaper
- i3 related bugs somewhere between the kernel and AWS hypervisor: 1668129 (Xen ballooning + hotplug memory)
- AZ affinity is good, enabling it dropped latency by 200ms / 25%