Scaleconf 2018 notes

by Steve Hoeksema, posted 07 Aug 2018.

Progressive Web Apps workshop

Worth considering if we want to be independent of the AWS ecosystem
Strong support on multiple providers (AWS, GCP, Azure, Catalyst)
Need to consider other related services along with compute/block/object, e.g. functions, managed SQL, NoSQL, queues
Course material

stratamap.io
Companies have 72 hours to notify affected users of a breach under GDPR. Is your tooling good enough to find affected users and what was exposed?
Archimate - an open successor of UML and BPMN
Useful for all kinds of mapping at multiple levels of detail - cloud infrastructure, customer journeys
Mandated by EU, NZ govt, etc
Archi tool

Detect anomalous changes in traffic volumes
Your service may be fine during an outage if the problem is at an ISP or CDN - will you know? Users don’t care where the problem lies
Measured metrics can be gamed. Post-load time should be included in performance measurements.
Metrics drive behaviour and may have perverse incentives
Hanoi rat massacre of 1902
Your metrics may miss the point - you might have great uptime but with incorrect results. The fastest page on Bing is an error message
Another Bing example: a results page that was entirely ads: great click rate, great revenue, wrong data
Another Bing example measuring click rate: if they can answer a question inline on the page (e.g. population of Wellington), no need for a click
Metrics should be high quality, regularly reviewed
Consider “dialtone availability” - you should always pick up a phone and get a dialtone. You should always be able to get to Bing home page and do a search. Other features can have lower SLAs
Combine outside-in (outside your network) metrics with anomaly detection

Contour which maps Kubernetes to Envoy sidecar proxy
Ingress controller benefits:
- traffic consolidation: combining multiple LBs into one
- TLS management
- configuration abstraction (over haproxy/nginx/apache etc)
- path-based routing
overview of golang dependency awfulness and their workarounds
Keep docker build/push out of dev lifecycle
Try and keep integration tests to a minimum

Applying the five W’s to testing
- What are we testing?
- When would be find out about an issue?
- Where are our tests running?
- Why did something go wrong?
- Who should know about these tests?
Seemed not to be directly relevant to a CI-on-PR workflow, where the answers to all five are right there?
Built up a integration test coverage map by tracking User-Agent on web server logs against API paths
Try and reduce noise in your test failures - retries, cleaning up and consolidating errors
Apply the same test for your users. Their test coverage was much different from user behaviour
- What are our users doing?
- When would be find out if they have a problem

Fundamental conflict between agility and stability, traditional dev vs ops/sysadmin
Devs typically “won” as they were closest to business needs
c.a. 2009, resolved by putting dev+ops in the same room, talking
c.a. 2018, also means breaking down barriers in security, marketing, etc
Accept partial failure as normal, necessary for rapid progress
Use outages as opportunity for leaning (blameless postmortems)
Eliminate toil
Measure everything - MTTR, deploy times, deploy frequency, toil
Paying attention to your measurements is another story
“class SRE implements DevOps”
Reduce cost of failure (blue/green deployments and rollbacks)
“Automate this year’s job”
Books free until August 23
Avoid “superheroes” - might be useful early in a small company, bad as it scales

“CSS as appetizer, JavaScript as dessert” - slows load times
Nokia 1, brand new, $140 smartphone than runs Android 9, good example of a low-end phone
Try and fit your first load in 14.6KiB
Can fit more in that 14K with gzip
Can fit even more with zopfli, brotli
Embed as much as possible (styles, js, data urls)
Speed up HTTPS first load with HSTS
Consider HSTS preload too
Speed up first byte with QUIC
New cache headers: immutable, stale-while-revalidate
Prefetch, preconnect, and preload with <link> elements or Link: headers
React to bad connections a la Netflix dynamic quality changes
Use client hints, save-data, and device-memory
Books

Used Lambda, Firebase, Algolia
API GW -> λ GraphQL -> λ -> DynamoDB -> λ -> Firebase
λ cost $700/mo, compared to cloudfront $5100, cloudwatch $3100
netlify, cypress, jest
DLQ for failed λ
Can set λ max concurrency to avoid slamming SQL DB. Can set to 0 as a kill switch
Each function:
- should perform 1 task
- should perform 0 or 1 data transformations
- should have no state and be idempotent
- should have minimal permissions
- should avoid recursion
Use step functions or durable functions to manage state
Serverless Code Patterns
acloud.guru
API Proxy, fanout, inline stream transformation patterns
Serverless Architectures on AWS
Serverless Conf [I stumbled across this by accident NY ‘16’]
How do you avoid vendor lockin? Not so much a problem for code, much more for surrouding ecosystem: SQS, SNS, DynamoDB, step functions

Monoliths have been working for decades
Enterprise data warehouses have also accumulated data monoliths
Focus on asynchronous events, immutable log streaming, and microservices
Strangler pattern
“Collection of info available “in a few weeks”
Event Gateway: identify event seams, publish events when a monolith changes state
Batch to event adapter
Event to batch adapter
Change data capture
Data Catalogue and Schema Registry with avro, parquet, protobuf, thrift
Map event types to topics + streams: can validate events are going where they supposed to, support introspection

Also available as an Atom feed.
This post licensed as CC BY-NC-SA.