CPU, GPU, and NPU: Understanding Key Differences and Their Roles in Artificial Intelligence

14 min readJun 18, 2024

English translation of an italian post that was originally published on Levysoft.it

First introduced in the 1960s, CPUs (Central Processing Units) have been the beating heart of every computer, responsible for executing all primary operations. Designed to be versatile and capable of handling a wide range of instructions and operations, they are ideal for running operating systems, productivity software, and many other general applications. However, with the advent of the first 3D video games and advanced graphic applications, the limitations of CPUs became apparent. Their architecture was not optimized for the massive parallel computing required by intensive graphic applications and scientific simulations, as they were specifically designed for general-purpose computing.

However, with the increasing demand for massive parallel processing required by intensive graphic applications and scientific simulations, the limitations of CPUs and math coprocessors became evident. This led, in the 1990s, to the development of GPUs (Graphics Processing Units) which soon became crucial and specialized in parallel processing of large volumes of data. GPUs (available in integrated graphics chips or standalone graphics cards) are structured with hundreds or thousands of smaller, specialized cores (ALU: Arithmetic Logic Units) that can simultaneously perform multiple operations, making them ideal for graphic rendering and, more recently, for training and deploying deep learning models.

In recent years, we have witnessed the emergence of a new category of processing units: NPUs (Neural Processing Units). While math coprocessors and GPUs have accelerated floating-point calculations and parallel processing of large data volumes, NPUs are designed to efficiently handle matrix multiplication and addition, essential for workloads associated with artificial intelligence (AI) and machine learning (ML), such as image recognition, natural language processing, and machine learning.

In practice, CPUs, GPUs, and NPUs are all crucial for the functioning of a computer, but each is optimized to handle different types of calculations and rendering. The GPU is specialized in rendering complex images for applications such as video editing and gaming, while the NPU handles repetitive and less complex AI tasks, such as background blurring in video calls or object detection in photo and video editing, offloading these minor tasks from the GPU. This allows the CPU and GPU to focus on more intensive and complex activities, improving overall system efficiency and preventing one from becoming overloaded, ensuring smooth system operation.

Matrices in AI

In artificial intelligence, especially in deep learning, matrix multiplication and addition (also known as GEMM, “General Matrix Multiplication”) are crucial for working with data.

What are matrices? Imagine a matrix as a table of numbers. For example, a black and white image can be represented as a matrix where each number represents a pixel. Similarly, the words in a sentence can be converted into numbers and placed in a matrix.

How do they work in neural networks? Neural networks use these matrices of numbers to make predictions or recognize patterns. Each “layer” of the network takes a matrix of numbers and transforms it into another matrix. This happens through matrix multiplication, which helps the network understand complex relationships between data.

And in language models? In large language models (LLMs), like those used to understand and generate text, matrices help represent and transform tokens. Tokens are small parts of text, such as words or symbols. These are converted into numbers and organized into matrices. Operations on these matrices allow the model to understand how words or tokens relate to each other, helping to grasp the context and meaning of the text. For example, when an AI model processes a sentence, matrices help understand how each word connects to the others, enabling it to generate responses or continue the text coherently. If you want to delve deeper into the concept of LLMs, I recommend this lengthy and unparalleled article by Arstechnica that explains, without technical jargon, how large language models of artificial intelligence work. Or you can take a look at this other interesting article by Stephen Wolfram on how ChatGPT works.

In summary, matrix multiplication and addition are like a recipe that the AI model follows to transform raw data into useful results, such as recognizing objects in an image or understanding a sentence. And this is where, as previously mentioned, NPUs come into play because they are specifically designed to quickly perform calculations on large amounts of data represented in matrix form, making them extremely efficient for artificial intelligence applications.

NPU Architecture

The architecture of NPUs is significantly different from that of CPUs and GPUs. While CPUs are designed to execute a variety of instructions sequentially and GPUs to perform multiple parallel operations, NPUs are built to specifically accelerate machine learning operations. This is achieved through:

Specialized compute units: NPUs integrate dedicated hardware for multiplication and accumulation operations, essential for training and inference of neural network models.
High-speed on-chip memory: To minimize bottlenecks related to memory access, NPUs feature high-speed integrated memory, allowing rapid access to model data and weights.
Parallel architecture: NPUs are designed to perform thousands of parallel operations, making them extremely efficient in processing data batches.

An NPU can process thousands of images per minute, analyzing them in detail. This enables the unit to “read” the surrounding reality, offering advanced experiences to users through the sensors and cameras of devices, performing complex tasks such as facial recognition, photo enhancement, and interaction with mixed reality applications.

A particular type of NPU is the TPU (Tensor Processing Unit), a processor developed by Google specifically to accelerate machine learning tasks in their data centers. In fact, while NPUs are a broader category of AI-dedicated processors designed by various manufacturers to enhance AI performance on a variety of devices and applications, TPUs are designed and optimized to handle workloads on TensorFlow and to efficiently perform tensor calculations in Google cloud services.

What Are TOPS in Artificial Intelligence

The term TOPS is not new in the tech world, but it has recently received a lot of mainstream attention with the rise of high-end AI PCs.

TOPS (Tera Operations Per Second) is a metric used to measure the computing capacity of an NPU or other AI-specialized processors. This measure indicates the number of operations the NPU can perform in one second, expressed in trillions (in the Anglo-Saxon system 1 trillion represents a million million, i.e., in decimal notation, 10^12 = 1,000,000,000,000). The key unit within an NPU for these operations is the MAC (Multiply-Accumulate), which performs both multiplications and additions (which, as previously noted, are fundamental for neural networks and other AI applications) in each clock cycle. The operating frequency of the NPU, i.e., the number of clock cycles per second, directly influences the number of operations that can be performed in a given time, thus determining the TOPS.

Although TOPS is not a perfect metric and many variables influence a system’s performance in executing AI tasks, chip manufacturers rely on this parameter to advertise their products, primarily to simplify performance metrics and help buyers understand what they are getting. The following formula for calculating TOPS provides a quick reference to evaluate an NPU’s speed and compare it with other units:

TOPS = (2 × Number of MAC units × Operating frequency in Hz) / 10^12

For example, if an NPU has 1000 MAC units and operates at 1 GHz (1 billion cycles per second), the TOPS are calculated as follows:

TOPS=(2×1,000×1,000,000,000)/10^12 = 2 TOPS

This metric provides a clear view of an NPU’s computational power, allowing for performance comparison between different units and AI architectures. However, it is important to note that a high TOPS value does not always guarantee optimal performance in real-world applications. Factors such as memory bandwidth, software optimizations, and system integration play a crucial role in determining an NPU’s actual effectiveness in everyday AI operations.

Differences Between GPU and NPU in TOPS

Reading that desktop GPUs, such as the NVIDIA RTX 4090, boast over 1,300 TOPS of performance, while Microsoft’s CoPilot+ laptops claim only 45 TOPS, could be confusing. But there is nothing inconsistent, and this difference has a clear explanation.

This number represents the theoretical maximum potential of the parallel operations that GPUs can perform, including both graphic and AI operations. GPUs are general-purpose, capable of performing a wide range of tasks, including 3D graphics calculations and AI acceleration, and are therefore designed to handle complex parallel calculations for both graphic and AI applications. Hence, their TOPS count includes the ability to handle intense graphic operations and AI calculations simultaneously.

NPUs, on the other hand, are specialized to optimize only AI operations, such as matrix multiplication and other deep learning operations. Even with a lower TOPS count than GPUs, NPUs are more efficient in their specific domain. So an NPU integrated into Qualcomm Snapdragon processors or Intel systems with Copilot+ with 40–45 TOPS is highly optimized to perform specific AI calculations like neural networks, but does not have the same breadth of applications as a GPU.

There is also another difference to consider: GPUs can operate at different precisions (FP32, FP16, etc.) to maximize performance in graphics and AI calculations, while NPUs often operate at lower precisions (INT8, INT4) to optimize efficiency and reduce power consumption. Therefore, values such as the 682 TOPS reported for the RTX 5000 Ada refer to specific AI Tensor performance of Tensor Cores using precisions like FP8.

Ultimately, GPUs and NPUs serve different purposes in AI computing. GPUs, with a high number of TOPS, handle a wide range of tasks simultaneously, including graphic operations and AI at various precisions. NPUs, despite having fewer TOPS, are optimized to efficiently perform specific AI operations, often using lower precisions. Therefore, while TOPS is a useful metric, it does not provide a complete assessment and must be considered in the context of specific applications and system architecture.

GPU and AI: The Evolution of AI Accelerators

Before the advent of NPUs, GPUs and integrated solutions in processors like Apple’s M series dominated the field of local AI. GPUs, particularly those produced by NVIDIA, with their CUDA and Tensor Core architectures, were the de facto standard for training and inference of machine learning models due to their parallel architecture and the ability to perform huge volumes of parallel calculations in significantly reduced times. These GPUs powered not only PCs but also data center servers, enabling large-scale deep learning model training and becomingthe undisputed leader in the data center GPU sector (holding 98% of the market, with 3.76 million GPUs shipped in 2023).

NPUs (also known as AI accelerators), were first introduced by Huawei and Apple with the Mate 10 Pro and iPhone X models. These specialized chips performed complex calculations for artificial intelligence and machine learning, significantly improving performance and energy efficiency. Specifically, for Apple, the A11 Bionic AI chip, introduced with the iPhone X, assisted the SoC’s graphics accelerator to create studio-quality animations, for Face ID facial recognition, and to create animoji. For Huawei, the AI Kirin 970 chip, with its “distributed computing” structure, was specialized in image analysis, handling verbal requests and real-time translations without excessively impacting battery life.
Later, Qualcomm’s AI SoC, the Snapdragon 845, came to market, handling repetitive and intensive tasks, improving the main processor’s efficiency and speeding up the smartphone’s overall operation.

Apple M Processors

With the introduction of the M series processors (like M1, M2, M3, and the latest M4), Apple has taken local AI to a new level in consumer devices. These processors integrate an advanced GPU and a neural engine that accelerates AI operations directly on the device, without the need for cloud services. This is particularly useful for applications requiring low latency and high energy efficiency, such as voice and facial recognition, computational photography, and augmented reality applications.
Apple’s integrated solutions with their AI-optimized cores are an example of how hardware can be specifically designed to handle machine learning workloads, significantly improving performance and energy efficiency compared to previous solutions.

Apple M4 vs Snapdragon X Elite

The high-end chips with advanced NPU capabilities (or Neural Engine, as defined by Apple) are currently produced by Apple and Qualcomm. The Neural Engine of the Apple M4 has 16 cores and can perform up to 38 TOPS. In comparison, the NPU of the Apple M3 reaches 18 TOPS, the Apple M2 16 TOPS, the Apple M1 11 TOPS, and the NPU of the Apple A17 in the iPhone 15 Pro offers 35 TOPS. On the other hand, the Hexagon NPU of the Snapdragon X Elite reaches up to 45 TOPS.

It is important to note that TOPS values do not have a complete meaning without considering the type of operations and precision. Both the 45 TOPS for Qualcomm and 38 TOPS for Apple are based on INT8 operations (8-bit integers), as confirmed by expert Ben Bajarin. This positions Qualcomm ahead in terms of AI computing capacity, although Apple remains a leader in consumer chipsets compared to Intel and AMD.

In fact, analyzing further, the Apple M4 offers superior CPU performance compared to the Snapdragon X Elite, with a 23% speed advantage over the top-tier X Elite variant. Even in the graphics sector, the 6-core Adreno GPU of the Snapdragon X Elite cannot compete with the 10-core GPU of the Apple M4.

Ultimately, in terms of NPU performance, the Snapdragon X Elite excels with its 45 TOPS, surpassing the 38 TOPS of the Apple M4. However, the Apple M4 has a substantial advantage in both CPU and GPU performance. The Snapdragon X Elite approaches the Apple M3 but lags behind the M4 by a significant margin. Especially in terms of energy efficiency, Apple has demonstrated clear superiority over the competition. However, if we only consider the NPU aspect, the Snapdragon X Elite stands out as a true strong point.

Microsoft Copilot+ PCs: NPU Integration

Microsoft has recently introduced a new category of Windows PCs called Copilot+ PCs, specifically designed for artificial intelligence. These PCs feature new chips with incredible performance, capable of executing over 40 trillion operations per second (TOPS). They will initially ship with Qualcomm Snapdragon X Series chips, but Intel will also soon offer its Lunar Lake processors that, in addition to the 40 TOPS NPU, will also feature over 60 GPU TOPS, providing more than 100 platform TOPS. AMD is also working on new Strix chips with 50 TOPS neural processors, which will be released in July 2024.
Copilot+ PCs leverage Neural Processing Units (NPUs) to execute advanced AI models, offering innovative experiences such as information recall with Recall, real-time image generation with Cocreator, and audio translation in over 40 languages with Live Captions.

These devices feature a new system architecture that combines high-performance CPU, GPU, and NPU. Connected to advanced language models in the Azure cloud, Copilot+ PCs achieve unprecedented performance, resulting in up to 20 times more power and 100 times more efficiency in handling AI workloads compared to traditional PCs, allowing powerful AI experiences to run directly on the device, eliminating previous limitations such as latency, costs, and privacy.

Copilot+ PCs will be available from June 18, 2024, with prices starting at $999. Models include devices from Acer, ASUS, Dell, HP, Lenovo, Samsung, and Microsoft Surface, each designed to offer new AI experiences in a sleek, lightweight, and powerful design.

Monitoring the NPU from the Windows Task Manager

With the introduction of a new AI-dedicated chip, the need to monitor its usage has emerged. In notebooks equipped with Intel Core Ultra CPUs, the Windows Task Manager now includes a section dedicated to Intel AI Boost, allowing users to view NPU (Neural Processing Unit) activity in real-time. AMD announced that notebooks equipped with Ryzen 8040 series CPUs will offer a similar feature. In the future, it will be possible to monitor the Ryzen AI NPU, based on the XDNA architecture, directly from the Task Manager. This integration will provide users with a clear view of NPU performance and usage, facilitating AI workload management in systems that support this technology.

AI on Raspberry Pi 5

If you thought NPUs were only available for high-end computers, you’d be surprised to learn that the Raspberry Pi Foundation recently released, for $70, an AI KIT for the Raspberry Pi 5, consisting of a bundle with the Raspberry Pi M.2 HAT+ (initially developed to connect NVMe units and other PCIe accessories) and an AI Hailo-8L M.2 acceleration module that incorporates an NPU capable of providing 13 TOPS (Tera Operations Per Second) of computing power to the small processor. Previously, only Google offered AI accelerator modules like the Google Coral, but only with 4 TOPS of power.
Although the AI HAT for Raspberry Pi 5 does not yet offer sufficient capacity to significantly boost large language models run locally (LLMs with billions of parameters like LLaMA), it is ideal for light and specific AI applications in embedded computing. Besides the main AI frameworks like TensorFlow, TensorFlow Lite, Keras, PyTorch, and ONNX, the Hailo-8L M.2 module is ideal for applications such as voice recognition and computer vision for object recognition, face detection, and scene analysis in real-time for simple surveillance, autonomous vehicles, and video analysis, while maintaining low power consumption.

For those who enjoy experimenting with unconventional solutions, some have tried a Frankenstein configuration, not supported by official vendors. This solution uses a Raspberry Pi 5 integrated with two Hailo NPUs (a Hailo-8L with 13 TOPS and a Hailo-8 with 26 TOPS), a Coral Dual Edge TPU (8 TOPS), and a Coral Edge TPU (4 TOPS), achieving a total of 51 TOPS. This configuration was specifically designed to excel in various computing power benchmarks, surpassing the Copilot+ PC standard with at least 40 TOPS, Qualcomm’s Snapdragon X with 45 TOPS, Apple’s M4 with 38 TOPS, Intel’s Lunar Lake with 48 TOPS, and AMD’s AI 300 series with 50 TOPS.

Conclusions

Neural Processing Units (NPUs) are redefining the landscape of computing, becoming essential to meet the growing demands of artificial intelligence applications. With their specialized architecture designed to accelerate machine learning and AI, NPUs significantly enhance efficiency and performance across various sectors.

For developers and businesses, integrating NPUs into systems offers several practical advantages:

Local and real-time execution: NPUs enable local execution of complex AI models, reducing dependence on cloud services and lowering latency for critical applications such as voice and facial recognition, medical diagnostics, and autonomous driving systems.
Resource optimization: While GPUs handle large volumes of data for general calculations and intensive graphics, NPUs free up GPU resources by focusing on specific and repetitive AI tasks, thereby improving overall operational efficiency.
Energy efficiency: NPUs are designed to operate at lower precisions, reducing energy consumption compared to traditional GPUs, which is particularly advantageous for mobile devices and edge applications where energy management is crucial.
Consumer and enterprise applications: For consumer devices, NPUs enable advanced features such as background blurring in video calls and computational photography. In enterprise contexts, they accelerate model training and real-time inference, opening new possibilities for industrial automation, data analysis, and security.

The growing adoption of NPUs is set to transform how we tackle AI computational challenges. To stay competitive, it is crucial for companies and IT professionals to understand how NPUs work and their practical applications, integrating them into their solutions to fully exploit the potential of artificial intelligence.