Silent chip defects may be corrupting data in modern computers

Date:

Share:

[ad_1]

Computing is often celebrated for its precision and speed. But researchers and hyperscale data center operators are warning of a growing threat that challenges one of computing’s core promises: correctness. The issue is known as silent data corruption (SDC) – a phenomenon where hardware defects cause programs to produce incorrect results without crashing, triggering an error, or leaving any visible trace.

The invisible threat inside modern chips

At the heart of the concern are silicon defects in CPUs, GPUs and AI accelerators. These defects can originate during chip design, manufacturing, or even develop later due to aging or environmental factors. While manufacturers screen for most faults, even the most rigorous production testing can only catch an estimated 95% to 99% of modeled defects. Some flawed chips inevitably make it into the field.

In certain cases, those defects lead to visible failures such as system crashes. But more troubling are silent errors. Here, a faulty logic gate or arithmetic unit may produce a wrong value during execution. If that value propagates through the program without triggering detection mechanisms, the system completes the task and returns an incorrect output – with no indication anything went wrong.

For decades, many believed SDCs were rare, almost mythical events. However, major hyperscale operators including Meta, Google and Alibaba have disclosed that roughly one in 1,000 CPUs in their fleets can produce silent corruptions under certain conditions. Similar concerns have been reported in GPUs and AI accelerators.

Correctness is a foundational property of computing. Whether processing financial transactions, running AI inference, or managing infrastructure, systems are expected to deliver accurate results within strict time constraints.

Silent corruption undermines that trust. Unlike crashes, which are immediately visible and prompt investigation, SDCs quietly alter outputs. In data centers operating millions of cores, even a small defect rate can translate into hundreds of incorrect program results per day.

The scale of modern computing intensifies the problem

Massive parallel architectures such as GPUs and AI accelerators contain thousands of arithmetic units. The more components a system includes, the higher the statistical likelihood that some will be defective.

Measuring SDCs directly is nearly impossible – by definition, they are silent. The industry must therefore estimate their rates and weigh the cost of prevention. Detection and correction mechanisms exist, but they can significantly increase silicon area, energy consumption and performance overhead.

Researchers are calling for multi-layer solutions, including improved manufacturing tests, fleet-level monitoring in data centers, smarter fault estimation models, and hardware-software co-design approaches that contain errors before they propagate.

As computing systems grow larger and faster, the challenge is clear: maintain both speed and correctness without unsustainable cost. In what some describe as a “Golden Age of Complexity,” ensuring that computing remains trustworthy may become one of the industry’s defining engineering battles.

[ad_2]

Source link

━ more like this

Sends shares Q1 2026 business update and product progress

Sends reported Q1 2026 updates sharing news on digital cards, app redesign, ClearBank integration, and fintech industry recognition. Sends, a fintech platform operated by Smartflow...

We swipe our phones all day, and scientists just ranked which ones are the most tiring

We all know staring at your phone for hours isn’t great for mental health. But what about your fingers? Previously, researchers couldn’t measure...

Two suspects have been arrested for allegedly shooting at Sam Altman’s house

OpenAI CEO Sam Altman's house may have been the target of a second attack after San Francisco Police Department arrested two suspects for...

You Can Soon Buy a $4,370 Humanoid Robot on AliExpress

Listing consumer electronics on the internet's large ecommerce marketplaces is a key step in “democratizing” the products, allowing them to be purchased by...
spot_img