Cybersecurity

Recovering from CrowdStrike, Prepping for the Next Incident

A bad update can bring down entire operations. Here’s how governments are returning to business as usual after the landmark CrowdStrike outage — and how to prepare for the next such incident.

July 30, 2024 •

Graphic illustration of computer with exclamation point error icon and repair man with wrench.

The global CrowdStrike fiasco has been massively disruptive, but this wasn’t the first time a bad update had far-reaching effects — and it “won’t be the last time we see an IT solution impacted this widely,” said TJ Sayers, director of intelligence and incident response at the Center for Internet Security (CIS).

The bad CrowdStrike update was due to “an engineering and configuration problem” — not malicious activity — but threat actors are likely taking note of the effects and considering what other vendors or services they could hit to reach worldwide user bases, Sayers said.

As such, organizations working to recover also need to be thinking of how to prepare against the next such incident.

Recovery itself has been a significant undertaking.

Philadelphia returned to normal operations in the middle of last week, after restoring more than 6,000 computer systems, the city said. As of July 24, only 1 percent of city devices had not been restored, likely those “that may be with individual employees and not yet seen by IT staff.” Meanwhile, New York City faced roughly 300,000 affected machines, according to published comments from Chief Technology Officer Matt Fraser. As of July 23, the city had 40,000 left to fix. Recovery initially required staff to work on each of the thousands of downed machines individually.

“... part of the steps to remediate this required us to touch physically or, in some cases, the virtual machines themselves and go through a process,” Fraser said July 23 during a briefing from Mayor Eric Adams.

CrowdStrike, later in its response to the incident, developed a method to help organizations that had the affected product deployed in a government or commercial cloud environment, Sayers said. Under this approach, staff would manually reboot a downed device; then, if all went well, the product would automatically identify the flawed update, and quarantine or remove it before it could crash the system. In his comments, Fraser said the city would first test this automated fix in a lab environment to identify any new issues, rather than “use it at face value.” Speaking on Monday, Sayers said the solution has been effective for many state, local, tribal and territorial governments.

Still, the question isn’t just short-term recovery — it’s also how organizations can plan to better withstand the next such event.

Carefully deciding when to accept software updates could be important. Real-time cybersecurity updates are intended to keep up with cyber threats that evolve rapidly. But at the same time, quickly accepting a vendor’s update without first testing it runs the risk that it downs a system.

Sayers suggested organizations discuss with their vendors how quickly to update different kinds of systems, based on their particular operations and the systems’ criticality and likelihood of being targeted. For example, an organization could decide to let non-critical systems update several times an hour, while essential systems that cannot tolerate downtime would only get updates after internal IT teams reviewed and tested them. That aligns with previous public remarks from Fraser, who said New York City only lets automatic updates reach critical systems like 911 during certain time periods, and tests the updates in a sandbox first. Meanwhile, other systems like basic workstations, and those connected to the Internet, update in real time “because the risks to those machines are much greater.”

CrowdStrike itself is now making changes to how it approaches updates. In a preliminary post-incident review, the company said an error in the tool it uses to validate content configuration updates before publishing them had allowed the faulty update to go through. The company now plans to use more types of testing, to add more validation checks to its Content Validator tool, and to give customers more control over when and where content updates are deployed. It will also start introducing updates to a small user base at first before gradually rolling them out to all customers. That is expected to make problems easier to catch before they affect everyone.

In the future, organizations could consider whether outside factors make a potential software acquisition riskier, Sayers said. A product widely used by Fortune 100 companies, for example, has the added risk of being an attractive target to attackers hoping to hit many such victims in a single attack.

“There is a soft underbelly in the global IT world, where you can have instances where a particular piece of software or a particular vendor is so heavily relied upon that they themselves could potentially become a target in the future,” Sayers said.

Organizations also need to identify any single points of failure in their environments — instances where they rely on an IT solution whose disruption, whether deliberate or accidental, could disrupt their whole organization. When one is identified, they need to begin planning around the risks and looking for backup processes.

Sayers noted that some types of resiliency measures may be too expensive for most organizations to adopt; some entities are already priced out of just backing up all their data and many would be unable to afford maintaining backup, alternate IT infrastructure to which they could roll over.

“But at least having that awareness, so that you don’t wake up on a Friday morning and realize, ‘Holy cow, we can’t even operate as an organization, because there’s a problem with one solution that we have in our environment’ — I think the awareness piece will go a long way in helping adjust to, and react to, events like this in the future,” he said.

Tags:

Jule Pattison-Gordon

Jule Pattison-Gordon is a senior staff writer for Governing and former senior staff writer for Government Technology, where Jule specialized in cybersecurity. Jule also previously wrote for PYMNTS and The Bay State Banner and holds a B.A. in creative writing from Carnegie Mellon.

See More Stories by Jule Pattison-Gordon

IE 11 Not Supported

Recovering from CrowdStrike, Prepping for the Next Incident

A bad update can bring down entire operations. Here’s how governments are returning to business as usual after the landmark CrowdStrike outage — and how to prepare for the next such incident.

Tags: