Scaleconf 2018 notes
by Steve Hoeksema, posted 07 Aug 2018.
Progressive Web Apps workshop
- Fairly broad support as of iOS 11.3
- No support for native push notifications
- Still tricky to install, particularly on iOS
- Course material
Kubernetes workshop
- Worth considering if we want to be independent of the AWS ecosystem
- Strong support on multiple providers (AWS, GCP, Azure, Catalyst)
- Need to consider other related services along with compute/block/object, e.g. functions, managed SQL, NoSQL, queues
- Course material
Scaling the clouds - a design approach
- stratamap.io
- Companies have 72 hours to notify affected users of a breach under GDPR. Is your tooling good enough to find affected users and what was exposed?
- Archimate - an open successor of UML and BPMN
- Useful for all kinds of mapping at multiple levels of detail - cloud infrastructure, customer journeys
- Mandated by EU, NZ govt, etc
- Archi tool
The Blockchain, a beginners guide to all the important bits
- An introduction to blockchain, bitcoin, ethereum, etc
A token walks into a SPA
- “Auth0 ambassador”
- An explanation of JWTs and how they are useful
- An analogy of JWTs as driver licenses
- JWT payload does not have to be JSON - could be xml, protobuf
- JWT algorithm overview
- JWT handbook
- How do you rotate private keys?
- Token expiry should be measured in minutes
My service runs at 99.999%, tweets about outages are fake: it’s our competition trying to malign us!
- Detect anomalous changes in traffic volumes
- Your service may be fine during an outage if the problem is at an ISP or CDN - will you know? Users don’t care where the problem lies
- Measured metrics can be gamed. Post-load time should be included in performance measurements.
- Metrics drive behaviour and may have perverse incentives
- Hanoi rat massacre of 1902
- Your metrics may miss the point - you might have great uptime but with incorrect results. The fastest page on Bing is an error message
- Another Bing example: a results page that was entirely ads: great click rate, great revenue, wrong data
- Another Bing example measuring click rate: if they can answer a question inline on the page (e.g. population of Wellington), no need for a click
- Metrics should be high quality, regularly reviewed
- Consider “dialtone availability” - you should always pick up a phone and get a dialtone. You should always be able to get to Bing home page and do a search. Other features can have lower SLAs
- Combine outside-in (outside your network) metrics with anomaly detection
Lessons learnt building Heptio Contour
- Contour which maps Kubernetes to Envoy sidecar proxy
- Ingress controller benefits:
- traffic consolidation: combining multiple LBs into one
- TLS management
- configuration abstraction (over haproxy/nginx/apache etc)
- path-based routing
- overview of golang dependency awfulness and their workarounds
- Keep docker build/push out of dev lifecycle
- Try and keep integration tests to a minimum
Architecting for performance in the cloud
- Uh, don’t do table scans?
- Consider whole stack - code, network, infra, DB
- track mean time to detection and mean time to recovery
- Even Amazon doesn’t always get it right
- Have a recovery plan
High Availability Microservices: Load Balancing in the World of Ingress and Service Mesh
- A promo for Nginx Plus - can consolidate some functions
- Cloud Native Landscape
- Istio
Making test automation observable
- Applying the five W’s to testing
- What are we testing?
- When would be find out about an issue?
- Where are our tests running?
- Why did something go wrong?
- Who should know about these tests?
- Seemed not to be directly relevant to a CI-on-PR workflow, where the answers to all five are right there?
- Built up a integration test coverage map by tracking User-Agent on web server logs against API paths
- Try and reduce noise in your test failures - retries, cleaning up and consolidating errors
- Apply the same test for your users. Their test coverage was much different from user behaviour
- What are our users doing?
- When would be find out if they have a problem
When two platforms become one: the complexities of merging 6 years of data
- Focus on customers
- Small, iterative changes
- Having two brands travis-ci.{com,org} hurt acquisition, diluted
- Tested new features on .org; paid customers felt left out they had to wait
- Make changes user-facing ASAP
DevOps vs SRE - Get off my lawn
- Fundamental conflict between agility and stability, traditional dev vs ops/sysadmin
- Devs typically “won” as they were closest to business needs
- c.a. 2009, resolved by putting dev+ops in the same room, talking
- c.a. 2018, also means breaking down barriers in security, marketing, etc
- Accept partial failure as normal, necessary for rapid progress
- Use outages as opportunity for leaning (blameless postmortems)
- Eliminate toil
- Measure everything - MTTR, deploy times, deploy frequency, toil
- Paying attention to your measurements is another story
- “
class SRE implements DevOps
” - Reduce cost of failure (blue/green deployments and rollbacks)
- “Automate this year’s job”
- Books free until August 23
- Avoid “superheroes” - might be useful early in a small company, bad as it scales
Hacking Web Performance
- “CSS as appetizer, JavaScript as dessert” - slows load times
- Nokia 1, brand new, $140 smartphone than runs Android 9, good example of a low-end phone
- Try and fit your first load in 14.6KiB
- Can fit more in that 14K with gzip
- Can fit even more with zopfli, brotli
- Embed as much as possible (styles, js, data urls)
- Speed up HTTPS first load with HSTS
- Consider HSTS preload too
- Speed up first byte with QUIC
- New cache headers: immutable, stale-while-revalidate
- Prefetch, preconnect, and preload with
<link>
elements orLink:
headers - React to bad connections a la Netflix dynamic quality changes
- Use client hints, save-data, and device-memory
- Books
How we automated our test database restores
- All devs, managers, testers at FMG can restore a test database at any time
- Had problems with polluted test data, slow restores, etc
- Used a snapshot tool for SQL Server
- Built a self-service tool for people to use
Serverless in Practise
- Used Lambda, Firebase, Algolia
- API GW -> λ GraphQL -> λ -> DynamoDB -> λ -> Firebase
- λ cost $700/mo, compared to cloudfront $5100, cloudwatch $3100
- netlify, cypress, jest
- DLQ for failed λ
- Can set λ max concurrency to avoid slamming SQL DB. Can set to 0 as a kill switch
- Each function:
- should perform 1 task
- should perform 0 or 1 data transformations
- should have no state and be idempotent
- should have minimal permissions
- should avoid recursion
- Use step functions or durable functions to manage state
- Serverless Code Patterns
- acloud.guru
- API Proxy, fanout, inline stream transformation patterns
- Serverless Architectures on AWS
- Serverless Conf [I stumbled across this by accident NY ‘16’]
- How do you avoid vendor lockin? Not so much a problem for code, much more for surrouding ecosystem: SQS, SNS, DynamoDB, step functions
Move legacy apps to Windows Containers to take advantage of modern infrastructure and orchestration
- 3 platform options: windows server core, nano server, .net core
- No GUI
- [some tips to do silent installs]
- Install prerequisites first to avoid multi-step installs
- ServiceMonitor.exe
- [example Dockerfile]
cmd.exe /c start
analogous to macOSopen
or Linuxxdg-open
- Can’t join activedirectory domains inside containers yet
- Can almost run Linux containers in Windows
Data patterns catalogue: simplifying the journey from monolith to modern data architectures
- Monoliths have been working for decades
- Enterprise data warehouses have also accumulated data monoliths
- Focus on asynchronous events, immutable log streaming, and microservices
- Strangler pattern
- “Collection of info available “in a few weeks”
- Event Gateway: identify event seams, publish events when a monolith changes state
- Batch to event adapter
- Event to batch adapter
- Change data capture
- Data Catalogue and Schema Registry with avro, parquet, protobuf, thrift
- Map event types to topics + streams: can validate events are going where they supposed to, support introspection