Cybersecurity

Local CIO: CrowdStrike Outage Was ‘Unplanned’ Tabletop Exercise

Sunny Isles Beach, Fla., CIO Derrick Arias offers his account of triaging the July CrowdStrike/Microsoft event and what his team will take from the experience to apply when — not if — they experience another outage.

August 01, 2024 •

Derrick Arias

A classroom with a long table and multiple computer stations.

The CrowdStrike update that affected Microsoft systems and took out computers in both governments and businesses alike on July 19 was an incident that IT teams will be learning from for some time to come. While the long-term effects of the event still play out, Derrick Arias, CIO of Sunny Isles Beach, Fla., a small town in Miami-Dade County, offered his first-person account of what that day looked like on the ground and what his team will take away from what he called an "unplanned tabletop exercise."

It’s 2:45 a.m., and I’m awakened by the vibration of my cellphone on the nightstand. I groggily come to, but not before the phone stops. “Thank God,” I think, “not that important.” But then it starts again. The call is from one of my senior analysts. I answer, and he begins to explain the situation.

But let's start at the beginning.

Early Friday morning, July 19, 2024, things are very quiet in our Police Communications Center in Sunny Isles Beach, Fla., at 1 a.m. Only one communications officer is on duty, staring at the CCTV HyperWall and her two computer screens. One of them displays the computer-aided dispatch (CAD) system, showing the status of each on-duty officer. Suddenly, the window turns white. “That’s strange,” she mumbles, and proceeds to close the window and restart the application. After a delay, the error message came: failure to connect. She looks at the IT on-call list and begins dialing.

After trying the third number, someone finally answers at 1:10 a.m. It’s the senior analyst who will eventually call me 90 minutes later. The officer explains that CAD is down, and the analyst begins to work, trying to figure out why the software will not function.

Immediately, he starts to connect to the city’s VPN and attempts to check the police servers. One after another, he finds that he cannot get through — CAD, RMS, SQL, domain controllers, none are responsive. Even basic network connectivity checks are failing. He calls the Communications Center back to speak with the dispatcher, and something she says captures his attention: She can see blue screens on many of the computers through the camera feeds from the government center. He suddenly has a flashback to 2012 at a former employer, when a bad McAfee update wreaked havoc on the network, crashing all of the systems.

By this time, a second analyst is involved. They ask the dispatcher if there is any other information on the blue screen and obtain their second clue: The issue involves the file csagent.sys. Immediately they make the connection to CrowdStrike. A quick Internet search for “CrowdStrike issue blue screen” returns multiple reports concerning the outage, providing a third data point. Between the analyst's previous experience in 2012 with McAfee, the csagent.sys reference on the blue screen and the Internet reports, he realizes this is something serious that needs to be escalated. He makes the 2:45 a.m. phone call to me, after which he calls our assistant CIO.

I immediately begin a group chat with our city manager’s office and police command staff, informing them of the issue, and begin driving to work. I wonder, “Is this what they think it is, a software glitch? Or could this be some sort of attack?” Concern lingers as I make my way through the empty streets.

By the time I reach our Government Center at 3:30 a.m., I have a text from my colleague. It's a screenshot he found on Twitter of a tech alert issued by CrowdStrike, with a four-step fix to restore systems. As I enter the building and head to my office, I notice every computer screen I pass is showing the Blue Screen of Death — except one, the communication officer's. I ignore that for the time being.

I reach my office, and my PC is also down. But my personal laptop is fine, so I have a device to work with if needed. I immediately get to work confirming the fix from CrowdStrike on one of our systems and it restarts normally. Unfortunately, this means that we are going to need to touch every single machine individually because the fix cannot be automated. At this point, my colleague arrives, and after a quick update, I decide that he should begin working on getting our police servers back up while I go to work on fixing the department's computers. First and foremost, we need to get our officers that are out on the street up and running again. But I need to verify that everything works properly before I call the whole IT staff in, so I ask the dispatcher to have the on-duty sergeant bring their laptop in first. As I enter the Communications Center, another nagging question — why is her machine not blue-screened? “No time to troubleshoot now,” I say to myself. I must keep things moving in order to have everyone back up quickly.

I send several messages on our IT department group chat: “All hands on deck! Everyone get in here asap.” It’s 4:06 a.m., and currently there are still just two of us onsite.

We begin running like machines. Step by step, following the process to fix each computer. He is working on the servers; I am working on the endpoints throughout the building. When he lets me know that he finally has all the police servers online, I go check with the dispatcher and she confirms — CAD is back up! It is now 5:30 a.m. I update the city manager’s office and police command staff, but I think nobody is awake yet because there has been no response on that chat. By now, the on-duty sergeant has arrived, so I fix his laptop and we confirm that everything works. Within minutes, all the on-duty officers are lining up in a conference room as I work on their laptops individually.

During this process, I run into another “weird” issue: One officer’s laptop is requesting a recovery key to boot. I skip this one until I am done with the others, then go to consult my colleague on this particular issue. We decided to set it aside and swap her laptop with one of the spare police laptops to get her back on the road.

As I continue to restore endpoints in police, the new shift of officers begins to arrive, and I tell them to assemble in the roll-call room with their devices. However, I realize that none of them has an issue because their devices were all turned off until now. Since they were off, they never received the bad CrowdStrike update, which has since been “fixed” by CrowdStrike, so as these devices boot up they receive a newer update and work fine.

Now, just after 6 a.m., I wonder, “Where is everyone?” I send out more urgent messages to the IT team, pressing them to come in as soon as possible. Meanwhile, with the Police Department fully operational again, we shift our focus to the civilian side of our operations. As we push forward with the restoration, yet another anomaly surfaces. The process I have been following is that I try to bring up a command prompt through the recovery menu options. After choosing “Command Prompt,” the device prompts you to choose an “Administrator” account to use for the session, providing a list to choose from — on all of our computers, this has simply included “Administrator.” However, this machine does not list any accounts to choose from, so I am unable to continue. I leave it as-is and move on to the next.

By 7 a.m., my staff and others begin to arrive. We continued bringing up endpoints until finally, by 8:45 a.m., all of the city’s information systems had been restored. Unfortunately, one of our software-as-a-service providers was still impacted by the issue and was not able to bring their system back online until approximately noon. This affected several city departments, which prevented them from providing full services Friday morning.

As much as I did not enjoy the 3 a.m. wake-up call and having to drive to work in the middle of the night, this turned out to be a fantastic exercise. Had this been a cyber attack, matters would have been exponentially worse. In that case, the need to have zero trust in any of our systems’ integrity would have dramatically dragged this recovery process out for at least weeks, if not months. In this case, we were able to essentially have everything restored before staff arrived at work Friday morning (apart from the police, of course).

This experience taught us a few things. Inconsistencies in our system configurations led to multiple anomalies. In the heat of recovery, we worked well bypassing those anomalies to bring up as many systems as quickly as possible. However, it highlighted the need for consistency, which admittedly is difficult to accomplish and maintain over prolonged periods of time. We found that the systems that did not show the Administrator account as a login option were due to the local admin account being disabled. The systems that did not blue screen did not have the CrowdStrike service running or had an older unaffected version of the agent. In all, it was only a total of about seven endpoints out of over 300.

Also, we obviously could have moved more quickly had everyone in IT come in right when we realized the need to manually fix every device. Admittedly, posting on a group chat at 4 a.m. was not an effective way of mobilizing staff. We will be implementing a phone tree system to make sure we operate effectively the next time the need arises. And no, there is no “if” — it will happen again.

But by far the biggest takeaway was that teamwork saves the day. It doesn’t matter how big or small your budget, or which particular products you have or don’t have — something bad is going to happen at some point. Being able to operate well as a team to get through the event is what ultimately allowed us to succeed; our constant check-ins with each other kept us focused in the right direction, and when needed, we made quick decisions as to our course of action. Our communication to the city’s management team flowed well, and overall, our systems are very well managed.

Derrick Arias has been chief information officer of Sunny Isles Beach, Fla., since 2012.

Tags:

IE 11 Not Supported

Local CIO: CrowdStrike Outage Was ‘Unplanned’ Tabletop Exercise

Sunny Isles Beach, Fla., CIO Derrick Arias offers his account of triaging the July CrowdStrike/Microsoft event and what his team will take from the experience to apply when — not if — they experience another outage.

Tags: