IE 11 Not Supported

For optimal browsing, we recommend Chrome, Firefox or Safari browsers.

What Can We Learn from the Largest Global IT Incident Ever?

On July 19, 2024, a CrowdStrike software update unleashed mayhem on computer systems at airports, banks and more from Australia to Atlanta. What happened, and what lessons can we take away?

a crowdstrike logo on a smartphone screen in the foreground and a blue computer screen in the background that says "your device ran into a problem and needs to restart"
Adobe Stock/Robert
“It was a hacker who caused this mess.”

Those were the words of the kind American Airlines (AA) agent who rebooked my wife and daughter onto new flights after their original flights were canceled at Detroit Metro Airport on Friday, July 19, 2024. After AA’s phone line said there was an eight-hour wait to speak with someone, we rushed to the airport. When we arrived, we waited almost an hour to get to this interaction.

I responded to the agent, "Really?"

“Yep, it was a cyber attack," they said. "Hard to believe that someone can cause this much havoc in 2024.”

A few minutes later, as we walked away from the ticket counter with new boarding passes and the bags checked, my wife, Priscilla, looked at me with a puzzled glance and whispered, “You said it was a CrowdStrike software update mistake. Why didn’t you correct her?"

“Too many people in line waiting,” I responded. “Besides, I didn’t want to get into an argument. She was very helpful and under a ton of stress.”

My wife and daughter were fortunate. Their new flights arrived at their destination only three hours late on the same day, while thousands of others on other airlines took several days to get flights out or get home.

Indeed, as I write this blog on Friday, July 26, a colleague just reported flight delays are still occurring from this incident.

And no, it was not a malicious hacker that caused the incident, but according to CrowdStrike, “the outage was caused by a defect found in a Falcon content update for Windows hosts.”

This defect resulted in the “blue screen of death” that everyone has been talking about for the past week.

DETAILS, PLEASE


To grasp the enormity of this problem globally, consider these media headlines and excerpts:

CNN We finally know what caused the global tech outage - and how much it cost: “All told, the outage may have cost Fortune 500 companies as much as $5.4 billion in revenues and gross profit, Parametrix said, not counting any secondary losses that may be attributed to lost productivity or reputational damage. Only a small portion, around 10 percent to 20 percent, may be covered by cybersecurity insurance policies, Parametrix added.

"Fitch Ratings, one of the largest U.S. credit ratings agencies, said Monday that the types of insurance likely to see the most claims stemming from the outage include business interruption insurance, travel insurance and event cancellation insurance.”

WDAM Delta under investigation for its response to CrowdStrike tech outage-related cancellations, DOT announces: “An investigation has opened into Delta’s response to the CrowdStrike outage after the airline continued to have a high number of flight cancellations even after other airlines had returned to normal.

“Transportation Secretary Pete Buttigieg announced the investigation on social media on Tuesday morning, saying the agency is making sure the airline is abiding by the law and 'taking care of its passengers during continued widespread disruptions.'”
Techradar Microsoft blames EU rules for its inability to lock down Windows following CrowdStrike incident: “Microsoft is reportedly analyzing whether restrictions enforced by the European Commission could be partly responsible for amplifying issues with Windows systems during the recent CrowdStrike outage incident.

“The Wall Street Journal (WSJ) notes that in an intriguing point concerning the security of Windows operating systems, Microsoft’s spokesperson pointed out a 2009 agreement with the Commission prevented the company from enhancing the OS's security more rigorously.

“The agreement came in response to a complaint, and required Microsoft to offer security software developers the same level of access to Windows as the company itself has.”

BBCCrowdStrike backlash over $10 apology voucher: “CrowdStrike is facing fresh backlash after giving staff and firms they work with a $10 UberEats voucher to say sorry for a global IT outage that caused chaos across airlines, banks and hospitals last week.”

LESSONS LEARNED


While this is early days and there are still ongoing efforts to recover and restore some systems, initial focus on what we can learn from this situation includes some of the following:

Jen Easterly (CISA director) on LinkedInOde to an Outage: “To channel my alter ego Bob Lord: 'We don’t have a cybersecurity problem, we have a software quality problem.'

“Now before you start throwing flaming poo at me, yes, I further recognize the irony of a cybersecurity vendor creating a defective update that temporarily crippled systems made by the world’s biggest software company. And to be clear, this was not a Microsoft issue. As I said at the top, we don’t yet fully know what happened or why, but one thing I do know is that any company that builds any kind of software should design, test, and deliver it with a priority on dramatically driving down the number of flaws — flaws which can be intentionally exploited by bad actors or flaws that can unintentionally take down critical services across the globe. The other thing I know is that anyone who consumes tech (yup — that's basically all of us) should demand that those technology and software manufactures do exactly that. Why we’ve been working with technology companies large and small, including CrowdStrike and Microsoft, to voluntarily commit to the Secure by Design pledge.”

CNBCCrowdStrike update that caused global outage likely skipped checks, experts say: “Security experts said CrowdStrike’s routine update of its widely used cybersecurity software, which caused clients’ computer systems to crash globally on Friday, apparently did not undergo adequate quality checks before it was deployed.

“The latest version of its Falcon Sensor software was meant to make CrowdStrike clients’ systems more secure against hacking by updating the threats it defends against. But faulty code in the update files resulted in one of the most widespread tech outages in recent years for companies using Microsoft’s Windows operating system.

“Global banks, airlines, hospitals and government offices were disrupted. CrowdStrike released information to fix affected systems, but experts said getting them back online would take time as it required manually weeding out the flawed code.”

Wall Street Journal CrowdStrike’s Botched Tech Update Wasn’t Unique. Are Lessons Ever Learned? Critical infrastructure is at the mercy of tech vendors doing everything right: “Friday’s global tech outage shows how fragile and interconnected infrastructure can be, proving once again companies are vulnerable to their trusted vendors.

“A defective software update from cybersecurity company CrowdStrike roiled businesses across the world. It wasn’t a cyberattack, but the effects of a worst-case attack quickly emerged: Airlines grounded flights, hospitals canceled procedures and key systems for everyday life were disrupted from Sydney to San Francisco.”

LOOKING DEEPER


The big question swirling around the world is simply: Who will pay for all this incident cleanup? My friend Michael McLaughlin has this to say on LinkedIn:

“Does a global cyber outage qualify as a 'material cybersecurity incident'? This is the question hundreds of companies are grappling with this week. Under the SEC cyber rule, public companies are required to promptly disclose material cybersecurity incidents under Item 1.05 of Form 8-K. If the company is unsure whether the incident is material, the SEC released guidance that those incidents should be reported under Item 8.01. … But what is a 'material cybersecurity incident'? What does this mean for CrowdStrike's public customers impacted by this event? Other companies should consider a range of factors when assessing whether this incident materially impacted them, such as: -Reputational harm -Remediation costs -Legal risks -Lost revenues -Insurance. Importantly, these should also be placed in the context of a global cyber outage — e.g., what is the reputational damage to single company amongst thousands impacted?”

Another interesting view comes from Tim Wessels, who wrote:

“This will be unique to each company. Yes, Microsoft authorized (WHQL) the use of the Falcon kernel mode driver, but Microsoft does not authorize the Falcon update file or pseudo-code that is likely delivered multiple times a day and run by the Falcon kernel mode driver. The Falcon kernel mode driver 'choked' on an update file filled with zeros and borked the Windows kernel, which caused the Windows kernel to 'blue screen' to save itself from any additional damage.

“Was CrowdStrike just lucky that this never happened before? It should have employed checks before running the update. One was when the update was ready to be pushed out from CrowdStrike, and the Falcon kernel mode driver should have performed a second check before running the update. This strikes me as negligent behavior by CrowdStrike to assume that its Falcon update file would always arrive in the correct format. I think CrowdStrike will likely find itself on the receiving end of a class-action lawsuit for negligence, resulting in huge customer damage.”

Dave DeWalt, founder and CEO of NightDragon, wrote this on LinkedIn about the situation:

“It takes a village in cybersecurity. As we grapple with this CrowdStrike outage today, sending out my gratitude to the hundreds of thousands of cyber teams, engineering teams and government leaders who are pouring their hearts and souls into getting all of us up and running again. It’s a Back to the Future moment in some ways for me — I’ve had a lot of people texting and calling to ask if today’s situation is bringing me back to a similar incident that happened when I was CEO of McAfee back in 2010, where more than 1,500 companies were crippled within just seconds because of a bad update (video here from back in the day: https://lnkd.in/gGDxju5t). At McAfee, this was one of my worst days as CEO, but it was also one of our best. True leaders are tested in these moments, and we’re seeing these leaders step up around us today: CISOs and CIOs, their incredible teams, CrowdStrike’s teams working to get a fix together within hours, CISA and Jen Easterly coordinating on the government front, and many, many more.

"George and his team have done an incredible job, working through the night in difficult circumstances to deliver a fix. It is a huge credit to the CrowdStrike team and their leadership that many woke up to a fix already available. I’ve also been honored to help myself through the night with government and private sector response, as well as watch firsthand the work of CISOs, CIOs and their teams working tirelessly to get these fixes implemented, manually updating servers and getting our flights, and hospitals, and systems all running again. We owe all of these teams a debt of gratitude."

MY PERSPECTIVE


I can relate to Dave DeWalt's perspective on this incident. When I was CTO in Michigan, the biggest outage ever happened back in 2010 as a result of human error in trying to set up new backup and recovery systems. A critical mistake by one staff member brought down a large section of the state’s infrastructure, including email.

Bottom line, mistakes will be made, but how will you recover from them? No doubt, CrowdStrike should have done a better job testing this update, but mistakes will be made.

Perhaps a more important question is how this scenario was not adequately understood and tested on Microsoft's operating system. This incident should underline the importance of not just testing but scenario planning.

How far should your teams go in tabletops and real exercises? More on this topic in this recent blog.

FINAL THOUGHTS


One more thing. Watch this brief video for a few more takeaways:
I like the lessons learned that are described by Jonathan Edwards, including the prevalence of fake news and misinformation surrounding incidents. Of course, this brings us back to my story from the airport and what messages are being spread.

Also, beware of “ambulance chasers,” sales people claiming this could never happen with their technology and/or security solutions.

Finally, Edwards covers the importance of asking “what if” and testing and scenario planning.
Daniel J. Lohrmann is an internationally recognized cybersecurity leader, technologist, keynote speaker and author.