Team-BHP > Shifting gears > Gadgets, Computers & Software
Register New Topics New Posts Top Thanked Team-BHP FAQ


Reply
  Search this Thread
1,691 views
Old 10th June 2021, 00:30   #1
Distinguished - BHPian
 
Join Date: Sep 2008
Location: --
Posts: 3,552
Thanked: 7,262 Times
June 8, 2021 | When half the world's internet went down

It started as just another day, but within a few hours it wasn't anymore. All our business sites were down, and a war room situation ensued. The culprit was identified and soon we were with a Fastly rep.

Not too long into the call and it became clear, we weren't the only ones affected. UK gov sites were down, Financial Times was down, BBC was down, the list kept growing.

What was happening?

Well, it seemed our (and a lot of others') CDN provider's systems had 'crashed'.

Vox has written a very nice article on this 'outage'.

https://www.vox.com/recode/2021/6/8/...-news-websites

It roughly lasted for an hour between 9:47 to 10:35 am UTC.

An interesting consequence of this outage was that the call volumes to internet service providers suddenly increased, as people thought all these sites can't be down at the same time!

As of today, Fastly has clarified the reasons behind this outage and what they are doing to ensure it doesn't happen again, in a blog post -
https://www.fastly.com/blog/summary-of-june-8-outage

The internet as usual is having a field day with all sorts of memes and one liners. But, I can only imagine the stress the team/people must be going through when this outage was ongoing. Hope Fastly takes appropriate actions in fixing the processes rather than put the blame on some individuals.
Dry Ice is offline   (9) Thanks
Old 10th June 2021, 00:48   #2
BHPian
 
Join Date: Oct 2019
Location: Bangalore
Posts: 498
Thanked: 1,286 Times
re: June 8, 2021 | When half the world's internet went down

Even markets panicked with global stocks having a bit of a "flash crash" while bonds rallied on fears this was a major global hack. Finally calmed down after a few minutes after fastly confirmed the issue was on their end.

I have seen fake news and rumors due to hacks moving markets ( most infamous was when AP's Twitter account was hacked and someone posted about an attack on the White House) but first time saw market panicking on just a rumor that this could be a large global hack.
anandhsub is offline   (2) Thanks
Old 10th June 2021, 07:24   #3
Distinguished - BHPian
 
Join Date: Oct 2012
Location: Delhi
Posts: 8,104
Thanked: 50,904 Times
re: June 8, 2021 | When half the world's internet went down

Quote:
Originally Posted by Dry Ice View Post
. Hope Fastly takes appropriate actions in fixing the processes rather than put the blame on some individuals.
I don’t think it is about blaming individuals. What is hugely worrying is that it could happen. And it really showed up who was prepared for such an eventuality and who was not.

From what I read it was triggered by a single configuration chance. Whether that was change was carried out incorrectly or something else remains to be seen. Fact remains that one action took down their whole service.

That sounds suspiciously like a major design issue.

Jeroen
Jeroen is online now   (1) Thanks
Old 10th June 2021, 07:46   #4
Senior - BHPian
 
padmrajravi's Avatar
 
Join Date: May 2019
Location: Kozhikode
Posts: 1,229
Thanked: 5,517 Times
re: June 8, 2021 | When half the world's internet went down

A single configuration change from one customer affecting all the other customers is the worrying factor. And the configuration change was was a valid one too, just that it triggered an unexpected behaviour. That just means , there is not enough isolation enough to prevent a catastrophe.
padmrajravi is offline   (2) Thanks
Old 10th June 2021, 11:49   #5
Distinguished - BHPian
 
Join Date: Sep 2008
Location: --
Posts: 3,552
Thanked: 7,262 Times
Re: June 8, 2021 | When half the world's internet went down

Quote:
Originally Posted by Jeroen View Post

From what I read it was triggered by a single configuration chance. Whether that was change was carried out incorrectly or something else remains to be seen. Fact remains that one action took down their whole service.

That sounds suspiciously like a major design issue.

Jeroen
Quote:
Originally Posted by padmrajravi View Post
A single configuration change from one customer affecting all the other customers is the worrying factor. And the configuration change was was a valid one too, just that it triggered an unexpected behaviour. That just means , there is not enough isolation enough to prevent a catastrophe.
As per their initial RCA, they introduced a change on May 12 which left a code path open that could cause varnish (the open source cache server Fastly is based on) to crash under very specific vcl flows.

It wasn't triggered by any customer until 8 June. A change by one particular customer on that fateful day caused the varnish nodes to crash all over. They weren't able to recover from this failure.

So yeah, isolation is a big problem and a major design flaw. They have highlighted it in the blog post I linked above, quoting the relevant section here -

Quote:
Broadly, this means fully leveraging the isolation capabilities of WebAssembly and Compute@Edge to build greater resiliency from the ground up.
Fastly's customers will also think about having a DR solution. A lot of the customers who use Fastly do so because of the ability to easily run some logic on the edge nodes, like implementing a paywall etc. So using a competing CDN will not really be straight forward.

Or people could go solo like Tesla - have their own in house implementation of open source Varnish for eventualities like these.

It sure will be an interesting few months ahead.

Last edited by Dry Ice : 10th June 2021 at 12:18. Reason: Date
Dry Ice is offline   (2) Thanks
Reply

Most Viewed


Copyright ©2000 - 2024, Team-BHP.com
Proudly powered by E2E Networks