Re: June 8, 2021 | When half the world's internet went down Quote:
Originally Posted by Jeroen
From what I read it was triggered by a single configuration chance. Whether that was change was carried out incorrectly or something else remains to be seen. Fact remains that one action took down their whole service.
That sounds suspiciously like a major design issue.
Jeroen | Quote:
Originally Posted by padmrajravi A single configuration change from one customer affecting all the other customers is the worrying factor. And the configuration change was was a valid one too, just that it triggered an unexpected behaviour. That just means , there is not enough isolation enough to prevent a catastrophe. |
As per their initial RCA, they introduced a change on May 12 which left a code path open that could cause varnish (the open source cache server Fastly is based on) to crash under very specific vcl flows.
It wasn't triggered by any customer until 8 June. A change by one particular customer on that fateful day caused the varnish nodes to crash all over. They weren't able to recover from this failure.
So yeah, isolation is a big problem and a major design flaw. They have highlighted it in the blog post I linked above, quoting the relevant section here - Quote:
Broadly, this means fully leveraging the isolation capabilities of WebAssembly and Compute@Edge to build greater resiliency from the ground up.
|
Fastly's customers will also think about having a DR solution. A lot of the customers who use Fastly do so because of the ability to easily run some logic on the edge nodes, like implementing a paywall etc. So using a competing CDN will not really be straight forward.
Or people could go solo like Tesla - have their own in house implementation of open source Varnish for eventualities like these.
It sure will be an interesting few months ahead.
Last edited by Dry Ice : 10th June 2021 at 12:18.
Reason: Date
|