October 31, 2019

DevOps Days Toronto 2019 conference notes

For me, the local DevOps Days is a must-go event – the tickets are extremely affordable and it is a great way to connect to the local community beyond the much smaller circle of people who attend monthly DevOps events. As with all DevOps days, tools and technologies are deemphasized; people and processes take precedence (in that order). This year’s venue, Evergreen Brickworks, is a great conference space with plenty of room and lots of fresh air, and I wish more Toronto events took advantage of it.

Drafting Success—Emily Freeman

Emily’s talk reflected on her experience writing DevOps for Dummies, and not just the Instagram highlight reel of success – the talk addressed frustrations, slow progress, writer’s block, and more. Emily drew numerous parallels between the writing process and engineering work, since both are creative endeavors; and that was the connection between writing and DevOps.

Publishers want to see the table of contents early on; it is the “design document” for a book. But plans will change, and as the book is written at some point the original table of contents might no longer be relevant. This might upset managers, but it is the way to get to the destination, via continuous improvement.

A certain amount of procrastination is a part of the creative process, it should not be suppressed. During that time, the brain will continue chipping away at the task.

Making Connections Visible—Dominica Degrandis

Dominica’s talk, followed by an open space in the afternoon, was about understanding and visualizing the overhead of coordination across multiple teams and the reasons for why it takes so long to get things done. These notes also include my takeaways from the open space.

Among various development/delivery tasks, communicating across teams is perhaps the hardest one to optimize. As complexity grows, teams and people specialize, and coordination costs go up. An agile approach has to be applied everywhere to make an impact: a developer team adopting agile is not making a major contribution to the entire value stream when all the processes around it (triage, financing, releases) are on monthly, quarterly, or yearly cadences.

When doing a flow/value stream exercise, have an executive in the room. Their jaw might drop when seeing how much communication is really required to get something done. Executive buy-in is a challenge in any transformation; bring an executive to the stage/event – CFO, legal… “when there’s a need to push back or to get budget, you’ll know if you have buy in.”

Value stream exercises are not the same as value stream mapping (the latter is less frequent and a different process) and are distinct from management (line management is focused on immediate, tactical decisions). All interdependent work belongs to a single value stream, and the bigger the organization, the more complex the value stream will be.

How to get started with value streams? A humble approach to discovery. “We know we haven’t been serving you well, and we want to serve you better.” Some common complaints: things take too long, requests made to the team disappear, deliverables are not predictable. Measure how long things are taking, starting from the point where something is approved as a project/initiative to work on. The clock should stop not when something is shipped, but when value is actually delivered to the customer. Wait time/work time ratio (“flow efficiency”) matters a lot. When this ratio is bad, speeding up the work doesn’t matter because wait time dwarfs work time.

Observation is another approach: if you have access to a team’s communication, like their Slack, you can see what their pain points are. “Fiary” - the running diary of fires encountered by the team. “Don’t ask people what they do, go and observe what they are doing.”

When optimizing for a single metric, the team will realign to optimize that metric to the possible detriment of other metrics and desirable outcomes. A balanced set of metrics will show what the relationships between the metrics actually are.

See also:

Double, Double, Toil and Trouble—Bozhidar Lenchov and Jesse Malone

This talk was about identifying and reducing toil. So what is toil?

  • for alerts, toil is an alert with a known path to resolution. A workflow that is prone to operator error. Repetitive actions.
  • any organizational overhead: meetings, HR paperwork…
  • “manual work and common concerns brought up by the team that add no lasting value”

Metrics for detecting toil: interrupts, pages, classification and tracking of unscheduled/reactive work. New team members’ perspective can be invaluable as existing team members could be somewhat “desensitized” to toil.

A simple team exercise to discuss toil is a “toil snake” - vertical axis is type/kind of toil activity, horizontal axis is who, when, and how long did it take. The longest “snakes” are the most common categories of toil. Those categories can then be sorted by best value vs. investment and frequency (relevant XKCD).

Inner Source—Wasim Hossain

Inner sourcing is the practice of adopting successful open source practices into the enterprise.

Three key practices:

  • open collaboration: allows input and work by people passionate about the topic, regardless of organizational affiliation.
  • modular and testable architecture: lowers the barriers of knowledge for first time contributors, reduces the risk of changes breaking things and regressing.
  • strong core team.

Questions to ask for inner sourcing to be successful:

  1. is our culture open and transparent?
  2. are we well resourced / do we have hard deadlines?
  3. do we contribute back to the open source community?
  4. are we running a single shared platform or a variety of platforms? (Max: a single platform makes it easier, but I wouldn’t say not having a single platform is a major blocker to delivering some value)

Just In Time Cloud Infrastructure—Austen Novis

Capital One, as a large organization, ran into three issues with a centralized infrastructure-as-code approach. The first one was infrastructure teams and application teams were separated and did not have good communication channels, resulting in problems when shared infrastructure was modified. The second one, as far as I could understand it, was that using a stateful system like Terraform did not work well in an environment where users interact directly with the infrastructure and stored state diverges from actual configuration. The third issue was sequencing of operations, something Terraform struggles with in certain scenarios.

As a result, Particle Cloud was born, as an implementation of “JITI” - just in time infrastructure. It does not have persistent state and does not attempt to reconcile existing infrastructure, so manual changes are wiped by the destroy/recreate deployment cycle. It can sequence steps explicitly. And it promotes a shared-nothing approach by packaging the required infrastructure code into the service’s repository and reducing dependencies. The destroy/recreate cycle also enables cost savings: the QA environment is destroyed during non-business hours.

Max: I’m not so sure I bought into the concept. Certainly, there are some good ideas: versioning infrastructure code with service code helps in understanding changes, and requires simpler tooling than attempting to apply changes in large IaC monorepos. However, the leading use case seems fairly simple - disposable environments for data science models - and the idea of “tearing down the world” every so often might not work as well for more persistent / higher availability services. Unless working in a shared-nothing world, this model also brings security challenges around insufficiently flexible IAM permissions, and does not entirely solve the scenario of centralized infra team wanting to change some bits in a standard way across all services.

Shortcuts and Scenic Routes—Matthew Chum and Monti Ghai

This talk explored the potential perils of three possible “shortcuts”, hot takes on implementing DevOps practices:

  • “automate everything” - can result in deeper silos, if automation work is offloaded to automation team and the service owners never touch automation. Maintaining automation code can be a major time sink, especially if the surrounding environment and tooling is changing rapidly. If automation is not continuous, configuration drift can result. https://xkcd.com/1319/

  • “you build it you run it” - can run into significant resistance: taking time away from feature work, frustration of environment differences when hands-on work is standard practice. A risk of continuing the culture of blame by piling additional operational workload on developers. The suggested solution is expert teams, infrastructure groups that help with shared tooling and allow “supported responsibility”.

  • “one size fits all” - needs no further discussion :)

My Favourite Errors—Hany Fahim

Hany’s talk explained the possible causes of some common HTTP and networking errors. I found more value in the classification of errors into “bad” (does not indicate the source of the problem; says the problem is somewhere on the way) and good (localized - the component with the error is known, and specific - the real reason or root cause for the problem can be easily found). Hany’s least favorite error is no error at all: the site is down, but everything is green.

Among lessons learned: the need to have monitoring timeouts (if a site check is hung / not responding, this is an actionable event) and knowing about all maintenance events, even “non service impacting”, of upstream and service providers.

Short Notes from Talks and Open Spaces

Communicating and measuring DevOps success: by inviting other teams who are interested in DevOps practices but don’t know where to start to team’s demos that showcase technology and processes, not just results. Focus on different areas of technology at different times to keep it interesting and to showcase different facets of the practice.

Sensory Friendly Monitoring: reflect on every alert received. Was it necessary? Can the resolution be automated? Was the way the alert was handled a permanent fix or was it a band aid? Was the situation urgent enough to warrant an alert?

Kubernetes - don’t believe the hype by Renata Rocha, an Ignite talk that drew thunderous applause: it is an extremely complex system, requiring time and cost to learn. Case studies and conference talks will help you learn the boundaries for a successful implmentation. Tight schedules and small infrastructures are huge warning signs that Kubernetes might not be appropriate. What works for Google might be a poor fit for you. Related lessons from Jacquelyne Grindrod’s talk: didn’t understand how things worked, so had a lot of issues troubleshooting Kubernetes. Used Helm, however writers of Helm charts might have had other intentions and goals than what we wanted.

What should incident commanders know about your services? what is it? why does it matter? who owns it? who is on call? Its runbook? Its Slack channel? A good, non-pun-based service name helps a lot.

Journey to the promised land: everyone and every single team will have different ideas on what the promised land is. Lack of communication/requirements resulted in building the wrong thing and wasting time. For coordinating and reducing interruption, the process that worked the best was a meeting of all teams to hear their expectations for the coming sprint. Using forms to receive help requests did not work: it felt like raising walls. Creating “office hours” didn’t work, other teams felt that they did not want to interrupt us.

Slackbots and any other webhook based development: ngrok is a great way to make something developed on your laptop instantly accesible to any platform making API calls and webhooks.