On July 19, 2024, the world experienced a disruption that reminded us the importance of secure, robust release processes and security checks in software systems. CrowdStrike, a leading cybersecurity company, released an update that inadvertently caused widespread system failures. Let’s dive into what happened, why it was so impactful, and what we can learn from it.
The Incident at a Glance
CrowdStrike’s routine sensor configuration update, released at 04:09 UTC, triggered a logic error that resulted in system crashes and Blue Screen of Death (BSOD) errors on approximately 8.5 million Windows devices. The impact was far-reaching, affecting airlines, hospitals, airports, banks, emergency services, hotels, stock markets, broadcasting, and more.
The blue screen of death (BSOD) is an error screen displayed by Windows when the operating system encounters a critical failure it cannot recover from. It typically results from hardware issues, faulty drivers, or software conflicts, causing the system to crash. The screen shows an error message and often a QR code for troubleshooting. Users can resolve BSODs by updating drivers, running diagnostics, or reinstalling Windows if necessary. Similar to a kernel panic in Unix-like systems.
A photo taken at San Jose International Airport today shows the dreaded Microsoft “Blue Screen of Death” across the board. Credits: x.com/adamdubya1990
CrowdStrike’s Falcon platform operates in kernel mode, which gives it privileged access to system memory and critical functions. This level of access is necessary for the platform’s advanced threat detection capabilities. However, it also means that any errors in the kernel driver can have system-wide consequences.
The Technical Details
The update at 04:09 UTC and was targeting newly observed, malicious named pipes used in cyberattacks. However, a logic error in the update caused an operating system crash. The culprit was identified as Channel File 291, located in:
C:\\Windows\\System32\\drivers\\CrowdStrike\\
with a filename starting with C-00000291-
and ending with .sys
.
Since, Systems running Linux or macOS do not use Channel File 291 they were not impacted.
Why Couldn’t Windows Handle It?
You might wonder why Windows can’t be more resilient to such issues. Windows does offer recovery options like booting with the last known good configuration. However, CrowdStrike marked their driver as a boot driver – essential for starting the OS. This means that if the driver fails, the system can’t boot normally.
Falcon is a security product that needs to operate in kernel mode to monitor and analyze application behavior. By running in kernel mode, it can proactively detect new attacks before they’re categorized and listed in a formal definition. However, this also means that any bugs in Falcon’s kernel driver can cause system-wide issues.
CrowdStrike’s approach involves dynamic definition files that their kernel driver processes. These files are supposed to be safe and effective updates, but executing untrusted PE code in kernel mode is inherently risky. Recently, a faulty dynamic data file – essentially filled with zeros – caused the driver to crash the system.
The Fix
Fortunately, resolving the issue on affected machines is relatively straightforward:
- Boot into Safe Mode
- Navigate to
C:\\Windows\\System32\\drivers\\
and delete files matching the patternC0000291.sys
or similar
After these steps, systems should boot normally without the problematic update.
For more information visit CrowdStrike Support Article
Lessons Learned
This incident highlights several crucial points:
- Thorough Testing: Updates need to be rigorously tested in environments that closely mimic production systems.
- Gradual Rollouts: Implementing updates incrementally can help catch issues before they affect a large user base.
- Robust Release Processes: Stringent checks and balances in the software development lifecycle are critical.
Taking care of these processes into your software development life cycle, your organizations can establish more resilient systems, making it extremely challenging, if not impossible, to release updates without undergoing all essential security verifications.
The Role of AI in Prevention
As we move forward, AI-powered tools are becoming increasingly valuable in preventing such incidents. These tools can automate tasks, identify potential threats, and respond to incidents more effectively.
At ARMUR, we strongly advocate for the adoption of LLM Powered tools that can help prevent such incidents. We provide wide range of AI Code Vulnerability scanning, which can save you time and the chaos.
Conclusion
This event is a significant reminder of the consequences of neglecting extensive code vulnerability tests. ARMUR urges all organizations to take this moment to review and improve their release processes. Remember, with great power comes great responsibility. It’s time to leverage AI-powered technologies to automate tasks, identify threats, and respond to incidents more effectively. To stay updated with the latest security discussions, you can join our Discord server where we share valuable insights and learning resources such as SecOps, DevSec and Red Teaming. Additionally, you can try out ARMUR and get free 100 credits.