Qualitest's Take on the CrowdStrike Global Systems Outage: Analyzing the Causes and Future Prevention

CrowdStrike, the cybersecurity company that crashed computer systems across the globe for Windows PC users, has identified a quality-control defect that led to outages for millions of Microsoft Windows PC users.

On 24 July 2024 they published an incident report which found a bug in a quality-control tool used to check system updates for errors. Millions of devices were affected because of this outage spreading across different sectors including travel, healthcare, and many more. The issue was mainly causing a NULL pointer exception from the memory leading to Blue Screen of Death.

This outage has been talked about all over the world since it happened and has shined a light on disaster recovery capabilities and business continuity plans. It also raises many questions around the amount of testing organizations conduct and the effectiveness of testing before updates are deployed onto live machines.

Global system shutdown: what exactly happened?

Early in the morning on 19 July 2024, it became clear that a major IT issue had hit infrastructure and services globally, with healthcare, financial services, banking and aviation all affected by the glitch.

The IT glitch originated from cyber security organization CrowdStrike, who confirmed the issue and stated that it came from a “defect” in a content update for its Microsoft Windows users, specifically, according to CrowdStrike’s CEO, a “defect in a single content update for Windows hosts.” Simply put, a flaw in a software update that was pushed out to CrowdStrike’s customers using Windows PCs.

The resulting chaos was widespread and profound and resulted in flights being grounded, the rail network grinding to a halt, healthcare services were affected including many doctors’ surgeries, prescription collections were affected, and online payment systems were shut down – even the London Stock Exchange was affected. It also affected some broadcasting organizations; Sky News in the UK was forced off the air because of the IT glitch.

In the US, most flights got cancelled over the weekend. Delta Airlines particularly fell victim to the large-scale outage affecting more than 5,000 flights.

The flaw contained in the update caused many Windows PCs to crash, and many displayed the well-known “blue screen of death” and became unstable. Around the world, the IT infrastructure at many institutions and organizations collapsed, thus causing their online systems to be taken offline.

CrowdStrike’s response to what happened

The CEO of CrowdStrike, George Kurtz, said at the time of the glitch that they were “actively working” with those impacted. He confirmed that the outage was not a “security incident or cyber-attack”, that the issue had been “identified” and “isolated” and a “fix has been deployed”. However, he urged their customers to keep checking CrowdStrike’s support portal for assistance and updates and added his team was “fully mobilized to ensure the security and stability of CrowdStrike customers”.

As a result of the glitch, CrowdStrike’s share price plunged and, according to a report released from the security vendor, they now plan to do more testing of the type of update that caused the crashes around the world before sending them out. In addition, updates will be rolled out to larger groups of users, which is known as a “canary deployment”, so that problems can be detected before updates are launched more widely.

Staging the updates rollout process is also vital. Updates that have the power to crash systems should first be deployed to a quarantined area as, once there, they can be tested for issues before they are rolled out to wider company systems.

Lessons learned from what happened

As the world’s leading Quality Engineering specialist, we understand the sacramental importance of the CIA triad (confidentiality, integrity and availability).

This incident had a catastrophic impact on the availability component of the CIA triad, which not only incurred mammoth financial losses, but also had a major impact on reputation of many businesses worldwide. It serves as a wake-up call for organizations to prioritie Quality Engineering best practices. What could have been done and what can be done in the future to avoid these kinds of incidents?

This incident offers valuable lessons, stated as below, to help organizations avoid similar issues in the future.

Effective operational acceptance testing
Effective operational acceptance testing in a staging (production-like) environment is critical. It should cover not only functional aspects but also compatibility, performance, and security before deploying to production..

Effective business continuity and disaster recovery planning
It is crucial for organizations to have a proper business continuity plan and disaster recovery setup to make sure that events will not have an impact on the business, and organizations can easily recover from these kinds of incidents.

Continuous delivery/deployment pipelines
Organizations should have robust continuous delivery/deployment pipelines with a proper gated process to catch any bad code, and help customers with a rollback mechanism, before any larger deployments are rolled out.

Risk impact assessment
Organizations should implement a comprehensive risk impact assessment framework to evaluate the security risks of any deployment. This will help define the appropriate testing strategy and stages to reduce risk.

Catching bugs early to avoid outages

The CrowdStrike global IT issue underscores the vital role of testing in development. Thorough, systematic testing throughout the software development lifecycle identifies potential issues early, ensuring products and systems are reliable, secure, and user-friendly.

Catching issues early not only saves time and resources but also protects the reputation of organizations like CrowdStrike. Investing in robust quality engineering is essential for consistently delivering high-quality products that meet and exceed user expectations, while also preventing critical incidents. Organizations must prioritize quality engineering to maintain trust and reliability in their offerings.

Qualitest’s Take on the CrowdStrike Global Systems Outage: Analyzing the Causes and Future Prevention

share

Global system shutdown: what exactly happened?

CrowdStrike’s response to what happened

Lessons learned from what happened

Catching bugs early to avoid outages

Recent posts

share