June 8, 2021 | When half the world's internet went down

Dry Ice · 10th June 2021, 00:30

It started as just another day, but within a few hours it wasn't anymore. All our business sites were down, and a war room situation ensued. The culprit was identified and soon we were with a Fastly rep.

Not too long into the call and it became clear, we weren't the only ones affected. UK gov sites were down, Financial Times was down, BBC was down, the list kept growing.

What was happening?

Well, it seemed our (and a lot of others') CDN provider's systems had 'crashed'.

Vox has written a very nice article on this 'outage'.

https://www.vox.com/recode/2021/6/8/...-news-websites

It roughly lasted for an hour between 9:47 to 10:35 am UTC.

An interesting consequence of this outage was that the call volumes to internet service providers suddenly increased, as people thought all these sites can't be down at the same time!

As of today, Fastly has clarified the reasons behind this outage and what they are doing to ensure it doesn't happen again, in a blog post -
https://www.fastly.com/blog/summary-of-june-8-outage

The internet as usual is having a field day with all sorts of memes and one liners. But, I can only imagine the stress the team/people must be going through when this outage was ongoing. Hope Fastly takes appropriate actions in fixing the processes rather than put the blame on some individuals.

anandhsub · 10th June 2021, 00:48

Even markets panicked with global stocks having a bit of a "flash crash" while bonds rallied on fears this was a major global hack. Finally calmed down after a few minutes after fastly confirmed the issue was on their end.

I have seen fake news and rumors due to hacks moving markets ( most infamous was when AP's Twitter account was hacked and someone posted about an attack on the White House) but first time saw market panicking on just a rumor that this could be a large global hack.

Jeroen · 10th June 2021, 07:24

Quote:

Originally Posted by Dry Ice

. Hope Fastly takes appropriate actions in fixing the processes rather than put the blame on some individuals.

I don’t think it is about blaming individuals. What is hugely worrying is that it could happen. And it really showed up who was prepared for such an eventuality and who was not.

From what I read it was triggered by a single configuration chance. Whether that was change was carried out incorrectly or something else remains to be seen. Fact remains that one action took down their whole service.

That sounds suspiciously like a major design issue.

Jeroen

padmrajravi · 10th June 2021, 07:46

A single configuration change from one customer affecting all the other customers is the worrying factor. And the configuration change was was a valid one too, just that it triggered an unexpected behaviour. That just means , there is not enough isolation enough to prevent a catastrophe.

Dry Ice · 10th June 2021, 11:49

Quote:

Originally Posted by Jeroen

From what I read it was triggered by a single configuration chance. Whether that was change was carried out incorrectly or something else remains to be seen. Fact remains that one action took down their whole service.

That sounds suspiciously like a major design issue.

Jeroen

Quote:

Originally Posted by padmrajravi

A single configuration change from one customer affecting all the other customers is the worrying factor. And the configuration change was was a valid one too, just that it triggered an unexpected behaviour. That just means , there is not enough isolation enough to prevent a catastrophe.

As per their initial RCA, they introduced a change on May 12 which left a code path open that could cause varnish (the open source cache server Fastly is based on) to crash under very specific vcl flows.

It wasn't triggered by any customer until 8 June. A change by one particular customer on that fateful day caused the varnish nodes to crash all over. They weren't able to recover from this failure.

So yeah, isolation is a big problem and a major design flaw. They have highlighted it in the blog post I linked above, quoting the relevant section here -

Quote:

Broadly, this means fully leveraging the isolation capabilities of WebAssembly and Compute@Edge to build greater resiliency from the ground up.

Fastly's customers will also think about having a DR solution. A lot of the customers who use Fastly do so because of the ability to easily run some logic on the edge nodes, like implementing a paywall etc. So using a competing CDN will not really be straight forward.

Or people could go solo like Tesla - have their own in house implementation of open source Varnish for eventualities like these.

It sure will be an interesting few months ahead.

10th June 2021, 00:30	#1
Dry Ice Distinguished - BHPian Join Date: Sep 2008 Location: -- Posts: 3,552 Thanked: 7,262 Times	June 8, 2021 \| When half the world's internet went down It started as just another day, but within a few hours it wasn't anymore. All our business sites were down, and a war room situation ensued. The culprit was identified and soon we were with a Fastly rep. Not too long into the call and it became clear, we weren't the only ones affected. UK gov sites were down, Financial Times was down, BBC was down, the list kept growing. What was happening? Well, it seemed our (and a lot of others') CDN provider's systems had 'crashed'. Vox has written a very nice article on this 'outage'. https://www.vox.com/recode/2021/6/8/...-news-websites It roughly lasted for an hour between 9:47 to 10:35 am UTC. An interesting consequence of this outage was that the call volumes to internet service providers suddenly increased, as people thought all these sites can't be down at the same time! As of today, Fastly has clarified the reasons behind this outage and what they are doing to ensure it doesn't happen again, in a blog post - https://www.fastly.com/blog/summary-of-june-8-outage The internet as usual is having a field day with all sorts of memes and one liners. But, I can only imagine the stress the team/people must be going through when this outage was ongoing. Hope Fastly takes appropriate actions in fixing the processes rather than put the blame on some individuals.
	(9) Thanks

10th June 2021, 00:48	#2
anandhsub BHPian Join Date: Oct 2019 Location: Bangalore Posts: 498 Thanked: 1,286 Times	re: June 8, 2021 \| When half the world's internet went down Even markets panicked with global stocks having a bit of a "flash crash" while bonds rallied on fears this was a major global hack. Finally calmed down after a few minutes after fastly confirmed the issue was on their end. I have seen fake news and rumors due to hacks moving markets ( most infamous was when AP's Twitter account was hacked and someone posted about an attack on the White House) but first time saw market panicking on just a rumor that this could be a large global hack.
	(2) Thanks

10th June 2021, 07:46	#4
padmrajravi Senior - BHPian Join Date: May 2019 Location: Kozhikode Posts: 1,229 Thanked: 5,517 Times	re: June 8, 2021 \| When half the world's internet went down A single configuration change from one customer affecting all the other customers is the worrying factor. And the configuration change was was a valid one too, just that it triggered an unexpected behaviour. That just means , there is not enough isolation enough to prevent a catastrophe.
	(2) Thanks