Scaleconf 2018 notes

by Steve Hoeksema, posted 07 Aug 2018.

Progressive Web Apps workshop

  • Fairly broad support as of iOS 11.3
  • No support for native push notifications
  • Still tricky to install, particularly on iOS
  • Course material

Kubernetes workshop

  • Worth considering if we want to be independent of the AWS ecosystem
  • Strong support on multiple providers (AWS, GCP, Azure, Catalyst)
  • Need to consider other related services along with compute/block/object, e.g. functions, managed SQL, NoSQL, queues
  • Course material

Scaling the clouds - a design approach

  • stratamap.io
  • Companies have 72 hours to notify affected users of a breach under GDPR. Is your tooling good enough to find affected users and what was exposed?
  • Archimate - an open successor of UML and BPMN
  • Useful for all kinds of mapping at multiple levels of detail - cloud infrastructure, customer journeys
  • Mandated by EU, NZ govt, etc
  • Archi tool

The Blockchain, a beginners guide to all the important bits

  • An introduction to blockchain, bitcoin, ethereum, etc

A token walks into a SPA

  • “Auth0 ambassador”
  • An explanation of JWTs and how they are useful
  • An analogy of JWTs as driver licenses
  • JWT payload does not have to be JSON - could be xml, protobuf
  • JWT algorithm overview
  • JWT handbook
  • How do you rotate private keys?
  • Token expiry should be measured in minutes

My service runs at 99.999%, tweets about outages are fake: it’s our competition trying to malign us!

  • Detect anomalous changes in traffic volumes
  • Your service may be fine during an outage if the problem is at an ISP or CDN - will you know? Users don’t care where the problem lies
  • Measured metrics can be gamed. Post-load time should be included in performance measurements.
  • Metrics drive behaviour and may have perverse incentives
  • Hanoi rat massacre of 1902
  • Your metrics may miss the point - you might have great uptime but with incorrect results. The fastest page on Bing is an error message
  • Another Bing example: a results page that was entirely ads: great click rate, great revenue, wrong data
  • Another Bing example measuring click rate: if they can answer a question inline on the page (e.g. population of Wellington), no need for a click
  • Metrics should be high quality, regularly reviewed
  • Consider “dialtone availability” - you should always pick up a phone and get a dialtone. You should always be able to get to Bing home page and do a search. Other features can have lower SLAs
  • Combine outside-in (outside your network) metrics with anomaly detection

Lessons learnt building Heptio Contour

  • Contour which maps Kubernetes to Envoy sidecar proxy
  • Ingress controller benefits:
    • traffic consolidation: combining multiple LBs into one
    • TLS management
    • configuration abstraction (over haproxy/nginx/apache etc)
    • path-based routing
  • overview of golang dependency awfulness and their workarounds
  • Keep docker build/push out of dev lifecycle
  • Try and keep integration tests to a minimum

Architecting for performance in the cloud

High Availability Microservices: Load Balancing in the World of Ingress and Service Mesh

Making test automation observable

  • Applying the five W’s to testing
    • What are we testing?
    • When would be find out about an issue?
    • Where are our tests running?
    • Why did something go wrong?
    • Who should know about these tests?
  • Seemed not to be directly relevant to a CI-on-PR workflow, where the answers to all five are right there?
  • Built up a integration test coverage map by tracking User-Agent on web server logs against API paths
  • Try and reduce noise in your test failures - retries, cleaning up and consolidating errors
  • Apply the same test for your users. Their test coverage was much different from user behaviour
    • What are our users doing?
    • When would be find out if they have a problem

When two platforms become one: the complexities of merging 6 years of data

  • Focus on customers
  • Small, iterative changes
  • Having two brands travis-ci.{com,org} hurt acquisition, diluted
  • Tested new features on .org; paid customers felt left out they had to wait
  • Make changes user-facing ASAP

DevOps vs SRE - Get off my lawn

  • Fundamental conflict between agility and stability, traditional dev vs ops/sysadmin
  • Devs typically “won” as they were closest to business needs
  • c.a. 2009, resolved by putting dev+ops in the same room, talking
  • c.a. 2018, also means breaking down barriers in security, marketing, etc
  • Accept partial failure as normal, necessary for rapid progress
  • Use outages as opportunity for leaning (blameless postmortems)
  • Eliminate toil
  • Measure everything - MTTR, deploy times, deploy frequency, toil
  • Paying attention to your measurements is another story
  • class SRE implements DevOps
  • Reduce cost of failure (blue/green deployments and rollbacks)
  • “Automate this year’s job”
  • Books free until August 23
  • Avoid “superheroes” - might be useful early in a small company, bad as it scales

Hacking Web Performance

  • “CSS as appetizer, JavaScript as dessert” - slows load times
  • Nokia 1, brand new, $140 smartphone than runs Android 9, good example of a low-end phone
  • Try and fit your first load in 14.6KiB
  • Can fit more in that 14K with gzip
  • Can fit even more with zopfli, brotli
  • Embed as much as possible (styles, js, data urls)
  • Speed up HTTPS first load with HSTS
  • Consider HSTS preload too
  • Speed up first byte with QUIC
  • New cache headers: immutable, stale-while-revalidate
  • Prefetch, preconnect, and preload with <link> elements or Link: headers
  • React to bad connections a la Netflix dynamic quality changes
  • Use client hints, save-data, and device-memory
  • Books

How we automated our test database restores

  • All devs, managers, testers at FMG can restore a test database at any time
  • Had problems with polluted test data, slow restores, etc
  • Used a snapshot tool for SQL Server
  • Built a self-service tool for people to use

Serverless in Practise

  • Used Lambda, Firebase, Algolia
  • API GW -> λ GraphQL -> λ -> DynamoDB -> λ -> Firebase
  • λ cost $700/mo, compared to cloudfront $5100, cloudwatch $3100
  • netlify, cypress, jest
  • DLQ for failed λ
  • Can set λ max concurrency to avoid slamming SQL DB. Can set to 0 as a kill switch
  • Each function:
    • should perform 1 task
    • should perform 0 or 1 data transformations
    • should have no state and be idempotent
    • should have minimal permissions
    • should avoid recursion
  • Use step functions or durable functions to manage state
  • Serverless Code Patterns
  • acloud.guru
  • API Proxy, fanout, inline stream transformation patterns
  • Serverless Architectures on AWS
  • Serverless Conf [I stumbled across this by accident NY ‘16’]
  • How do you avoid vendor lockin? Not so much a problem for code, much more for surrouding ecosystem: SQS, SNS, DynamoDB, step functions

Move legacy apps to Windows Containers to take advantage of modern infrastructure and orchestration

  • 3 platform options: windows server core, nano server, .net core
  • No GUI
  • [some tips to do silent installs]
  • Install prerequisites first to avoid multi-step installs
  • ServiceMonitor.exe
  • [example Dockerfile]
  • cmd.exe /c start analogous to macOS open or Linux xdg-open
  • Can’t join activedirectory domains inside containers yet
  • Can almost run Linux containers in Windows

Data patterns catalogue: simplifying the journey from monolith to modern data architectures

  • Monoliths have been working for decades
  • Enterprise data warehouses have also accumulated data monoliths
  • Focus on asynchronous events, immutable log streaming, and microservices
  • Strangler pattern
  • “Collection of info available “in a few weeks”
  • Event Gateway: identify event seams, publish events when a monolith changes state
  • Batch to event adapter
  • Event to batch adapter
  • Change data capture
  • Data Catalogue and Schema Registry with avro, parquet, protobuf, thrift
  • Map event types to topics + streams: can validate events are going where they supposed to, support introspection
Also available as an Atom feed, or on Medium.
This post licensed as CC BY-NC-SA.