A crack in Nvidia's empire

Source: Silicon-based Institute

Author: He Luheng/Boss Dai

In 2012, two major events took place in the AI circle. In chronological order, the first one was the release of Google Brain, a long-standing Google team, as its "debut work" - a deep learning network "Google Cat" that can recognize cats, with 74.8% recognition The accuracy rate is 0.8% higher than the 74% of the winning algorithm of the well-known image recognition competition ImageNet the previous year.

But Google's high-profile moments lasted only a few months. In December 2012, the winner of the latest ImageNet was released. The deep learning master Hinton and his disciples brought the convolutional neural network AlexNet, which raised the recognition accuracy rate to 84%, thus starting the AI revolution of the next decade. Google Cat was buried in the dust of history.

Hinton with two students, 2012

It wasn't just the ImageNet model itself that shocked the industry. This neural network, which requires 14 million pictures and a total of 262 petaflops of floating-point operations, used only four NVIDIA Geforce GTX 580s during a week of training. For reference, Google Cat used 10 million pictures, 16,000 CPUs, and 1,000 computers [1] 。

It is rumored that Google also secretly participated in the competition this year, and the shock it received was directly reflected in the next action: Google spent $44 million to acquire the Hinton team, and immediately placed an order with Nvidia for a large number of GPUs for artificial intelligence. Training, and at the same time "sweeping goods" are also giants such as Microsoft and Facebook.

** Nvidia became the biggest winner, and its stock price rose by a maximum of 121 times in the next 10 years. An empire is born. **

But over the empire, two dark clouds gradually gathered. Google, which purchased goods from Nvidia back then, made a stunning debut with AlphaGo three years later, and defeated the human champion Ke Jie in 2017. Keen people discovered that the chip driving AlphaGo is no longer Nvidia's GPU, but Google's self-developed TPU chip.

Three years later, a similar scenario repeated itself. Tesla, once regarded as a benchmark customer by Huang Renxun, also bid farewell to Nvidia GPU. It first launched the FSD vehicle chip with NPU as the core, and then took out the D1 chip used to build AI training clusters. Li lost two of the most important customers in the AI era.

By 2022, the global IT cycle will enter a downward phase. Major cloud computing companies will cut GPU procurement budgets for data centers one after another. The tide of blockchain mining will gradually cool down. In addition, the U.S. chip ban on China will make it impossible to sell A100/H100 to China. For high-end graphics cards, Nvidia’s inventory surged, and its stock price fell by 2/3 from its peak.

At the end of 2022, ChatGPT was born, and GPUs, as the fuel for large-scale "alchemy", were looted again. Nvidia got a respite, but the third dark cloud followed: On April 18, 2023, the famous technology media The Information broke the news:* Microsoft, the initiator of this round of AI wave, is secretly developing its own AI chip* [2] 。

This chip called Athena is manufactured by TSMC and uses an advanced 5nm process. The number of Microsoft's R&D team is close to 300. Obviously, the goal of this chip is to replace the expensive A100/H100, provide a computing power engine for OpenAI, and will eventually snatch Nvidia's cake through Microsoft's Azure cloud service.

Microsoft is currently the largest purchaser of Nvidia's H100, and it was even rumored that it would "wrap up" the full-year production capacity of the H100. The breakup signal from Microsoft is undoubtedly a bolt from the blue. You must know that even when Intel was at its darkest, none of its customers "dare" to make their own CPU chips (except for Apple, which does not sell them externally).

Although Nvidia currently monopolizes 90% of the market for AI computing power with GPU+NVlink+CUDA, the first crack has appeared in the ** empire. **

01, the GPU that was not born for AI

From the very beginning, GPUs were not made for AI.

In October 1999, Nvidia released the GeForce 256, a graphics processing chip based on TSMC's 220nm process and integrating 23 million transistors. Nvidia extracted the initials "GPU" from the Graphics Processing Unit, and named GeForce 256 **"the world's first GPU". today.

At this time, artificial intelligence has been silent for many years, especially in the field of deep neural networks. Future Turing Award winners such as Geoffery Hinton and Yann LeCun are still sitting on the academic bench, and they never think about their careers. , will be completely changed by a GPU originally developed for gamers.

Who is GPU born for? image. More precisely, it was born to free the CPU from the drudgery of graphics display. The basic principle of image display is to divide the image of each frame into individual pixels, and then perform multiple rendering processes such as vertex processing, primitive processing, rasterization, fragment processing, pixel operation, etc., and finally display on the screen .

Source of the processing process from pixels to images: graphics compendium

Why do you say this is hard work? Do a simple arithmetic problem:

Assuming that there are 300,000 pixels on the screen, calculated at a frame rate of 60fps, 18 million renderings per second need to be completed, each time including the above five steps, corresponding to five instructions, that is to say, the CPU needs to complete 90 million instructions per second to Realize one-second screen presentation. As a reference, Intel's highest-performance CPU at that time had only 60 million calculations per second.

It's not because the CPU is weak, but because it is good at thread scheduling, so more space is given to the control unit and storage unit, and the computing unit used for calculation only occupies 20% of the space. On the contrary, the GPU is more than 80% of the space is the computing unit, which brings super parallel computing capabilities, and is more suitable for the fixed-step, repetitive and boring work of image display.

The internal structure of CPU and GPU, the green part is the computing unit

It was not until a few years later that some artificial intelligence scholars realized that GPUs with such characteristics are also suitable for deep learning training. Many classic deep neural network architectures have been proposed as early as the second half of the 20th century, but due to the lack of computing hardware to train them, many studies can only be "on paper", and the development has stagnated for a long time.

A gunshot in October 1999 brought GPUs to artificial intelligence. The training process of deep learning is to perform hierarchical operations on each input value according to the functions and parameters of each layer of the neural network, and finally obtain an output value, which requires a large number of matrix operations just like graphics rendering-this happens to be what GPU is best at thing.

A typical deep neural network architecture; source: towards data science

However, the image shows that although the amount of data processing is huge, most of the steps are fixed. Once the deep neural network is applied to the decision-making field, it will involve complex situations such as branch structures, and the parameters of each layer need to be trained based on massive data positive and negative feedback. Keep revising. These differences have laid hidden dangers for the adaptability of GPUs to AI in the future.

Today's Amazon AI/ML general manager Kumar Chellapilla is the first scholar to eat GPU crabs. In 2006, he used Nvidia's GeForce 7800 graphics card to implement the convolutional neural network (CNN) for the first time, and found that it was 4 times faster than using a CPU. This is the earliest known attempt to use GPUs for deep learning [3] 。

Kumar Chellapilla and Nvidia Geforce 7800

Kumar's work has not attracted widespread attention, mainly due to the high complexity of programming based on the GPU. But just at this time, Nvidia launched the CUDA platform in 2007, which greatly reduced the difficulty for developers to use GPU to train deep neural networks, which made deep learning believers see more hope.

Then in 2009, Stanford's Wu Enda and others published a breakthrough paper [6] , GPU shortens AI training time from weeks to hours by virtue of more than 70 times the computing power of CPU. This paper points the way for the hardware implementation of artificial intelligence. GPU has greatly accelerated the process of AI from paper to reality.

Andrew Ng (吴恩达)

It is worth mentioning that Wu Enda joined Google Brain in 2011 and is one of the leaders of the Google Cat project mentioned at the beginning. The reason why Google Brain failed to use the GPU in the end is unknown to outsiders, but before and after Wu Enda left Google to join Baidu, there have been rumors that it was because Google's attitude towards GPU was unclear.

**After countless people's exploration, the baton was finally handed over to the deep learning master Hinton, and the time has already pointed to 2012. **

In 2012, Hinton and two students, Alex Krizhevsky and Ilya Sutskeverz, designed a deep convolutional neural network, AlexNet, and planned to participate in the ImageNet competition this year. But the problem is that it may take several months to train AlexNet with a CPU, so they turned their attention to the GPU.

This GPU, which is crucial in the development history of deep learning, is the famous "nuclear bomb graphics card" GTX 580. As the flagship product of Nvidia's latest Fermi architecture, GTX 580 is stuffed with 512 CUDA cores (108 in the previous generation). While the computing power is leaping, the exaggerated power consumption and heat generation problems have also made Nvidia named "Nuclear Bomb Factory".

A's arsenic, B's honey. Compared with the "smoothness" when training neural networks with GPUs, the heat dissipation problem is nothing to mention. The Hinton team successfully completed the programming with the CUDA platform of Nvidia. With the support of two GTX 580 graphics cards, the training of 14 million pictures took only one week, and AlexNet successfully won the championship.

**Due to the influence of the ImageNet competition and Hinton himself, all artificial intelligence scholars realized the importance of the GPU in an instant. **

Two years later, Google took the GoogLeNet model to participate in ImageNet and won the championship with an accuracy rate of 93%, using NVIDIA GPUs. This year, the number of GPUs used by all participating teams soared to 110. Outside of competitions, GPU has become a "must-have consumption" for deep learning, sending Huang Renxun a steady stream of orders.

This allowed Nvidia to get rid of the shadow of the fiasco in the mobile market. After the release of the iPhone in 2007, the cake of smartphone chips expanded rapidly. Nvidia also tried to get a piece of the pie from Samsung, Qualcomm, and MediaTek. The heat dissipation problem failed. In the end, it was the field of artificial intelligence rescued by the GPU, which gave Nvidia a second growth curve.

But after all, GPU is not born for training neural networks. The faster artificial intelligence develops, the more these problems will be exposed.

For example, although the GPU is significantly different from the CPU, both of them basically follow the von Neumann structure, and the storage and operation are separated. The efficiency bottleneck brought about by this separation, after all, the steps of image processing are relatively fixed, and can be solved by more parallel operations, but it is very fatal in a neural network with many branch structures.

Every time a neural network adds a layer or a branch, it needs to increase a memory access to store data for backtracking, and the time spent on this is inevitable. Especially in the era of large models, the larger the model, the more memory access operations need to be performed-the energy consumed in memory access is many times higher than that in computing.

A simple analogy is that the GPU is a muscular man (with many computing units), but for every instruction received, he has to go back and look at the instruction manual (memory). Finally, as the size and complexity of the model increase, the man The time for real work is very limited, and instead, I am so tired of flipping through manuals that I foam at the mouth.

Memory issues are just one of the many "discomforts" of GPUs in deep neural network applications. Nvidia was aware of these problems from the very beginning, and quickly started to "magically modify" the GPU to make it more suitable for artificial intelligence application scenarios; and AI players who are keenly aware of the fire are also sneaking in, trying to use the defects of the GPU to pry open the corner of Huang Renxun's empire.

**An offensive and defensive battle begins. **

02, the dark battle between Google and Nvidia

Facing the overwhelming demand for AI computing power and the congenital defects of GPU, Huang Renxun offered two sets of solutions to go hand in hand.

**The first set is to continue to pile up computing power violently along the path of "the old fairy of computing power has boundless magic power". **In an era when the demand for AI computing power doubles every 3.5 months, computing power is the carrot hanging in front of the eyes of artificial intelligence companies, making them scold Huang Renxun for his superb sword skills while snatching it up like a dog All of Nvidia's capacity.

**The second set is to gradually solve the mismatch between GPU and artificial intelligence scenarios through "improved innovation". **These problems include but are not limited to power consumption, memory walls, bandwidth bottlenecks, low-precision calculations, high-speed connections, specific model optimizations... Since 2012, Nvidia has suddenly accelerated the speed of architecture updates.

After Nvidia released CUDA, it used a unified architecture to support the two major scenarios of Graphics and Computing. The first-generation architecture debuted in 2007 and was named Tesla. This was not because Huang Renxun wanted to show his favor to Musk, but to pay tribute to physicist Nikola Tesla (the earliest generation was the Curie architecture).

Since then, each generation of NVIDIA GPU architecture has been named after famous scientists, as shown in the figure below. In each iteration of the architecture, Nvidia continues to pile up computing power, while improving without "severing muscles and bones".

For example, the second-generation Fermi architecture in 2011 had the disadvantage of heat dissipation, while the third-generation architecture Kepler in 2012 shifted the overall design idea from high-perfermance to power-efficient to improve heat dissipation; and in order to solve the aforementioned problems For the problem of "muscle fools", the fourth-generation Maxwell architecture in 2014 added more logic control circuits inside to facilitate precise control.

In order to adapt to the AI scene, Nvidia's "magically modified" GPU is becoming more and more like a CPU to some extent-just as the CPU's excellent scheduling ability is at the expense of computing power, Nvidia has to restrain itself on the stacking of computing cores. However, no matter how you change the GPU with the burden of versatility, it will be difficult to match the dedicated chip in the AI scenario.

** The first to attack Nvidia was Google, which was the first to purchase GPUs on a large scale for AI computing. **

After showing off its muscles with GoogLeNet in 2014, Google no longer publicly participated in the machine recognition competition, and conspired to develop AI-specific chips. In 2016, Google took the lead with AlphaGo. After winning Li Shishi, it immediately launched its self-developed AI chip TPU, which caught Nvidia by surprise with a new architecture "born for AI".

TPU is the acronym for Tensor Processing Unit, and the Chinese name is "tensor processing unit". If Nvidia's "magic reform" of the GPU is to tear down the east wall to make up for the west wall, then the TPU is to fundamentally reduce the demand for storage and connection, and transfer the chip space to the calculation to the greatest extent. Specifically, the two Great means:

**The first is quantitative technology. **Modern computer calculations usually use high-precision data and take up more memory, but in fact, most of the neural network calculations do not require precision to reach 32-bit or 16-bit floating-point calculations. The essence of quantization technology is basically to combine 32-bit/16-bit Numbers are approximated to 8-bit integers, maintaining appropriate accuracy and reducing storage requirements.

The second is the systolic array, which is the matrix multiplication array, which is one of the most critical differences between TPU and GPU. To put it simply, neural network operations require a large number of matrix operations. The GPU can only disassemble the matrix calculations into multiple vector calculations step by step. Every time a group is completed, it needs to access the memory and save the results of this layer until all vector calculations are completed. , and then combine the results of each layer to obtain the output value.

In the TPU, thousands of computing units are directly connected to form a matrix multiplication array. As the computing core, matrix calculations can be performed directly. Except for loading data and functions at the beginning, there is no need to access storage units, which greatly reduces access. The frequency greatly speeds up the calculation speed of the TPU, and the energy consumption and physical space occupation are also greatly reduced.

CPU, GPU, TPU memory (memory) access times comparison

Google's TPU is very fast, and it only took 15 months from design, verification, mass production to final deployment into its own data center. After testing, the performance and power consumption of TPU in CNN, LSTM, MLP and other AI scenarios greatly outperformed Nvidia's GPU in the same period. **All the pressure was given to Nvidia at once. **

Being backstabbed by a big customer is uncomfortable, but Nvidia will not stand and be beaten, and a tug-of-war has begun.

Five months after Google launched the TPU, Nvidia also introduced the Pascal architecture of the 16nm process. On the one hand, the new architecture introduces the famous NVLink high-speed two-way interconnection technology, which greatly improves the connection bandwidth; on the other hand, it imitates the quantization technology of TPU, and improves the computing efficiency of the neural network by reducing the data accuracy.

In 2017, Nvidia launched Volta, the first architecture designed specifically for deep learning, which introduced TensorCore for the first time, which is specially used for matrix operations-although the 4×4 multiplication array is the same as the TPU 256×256 pulse array. The ratio is slightly shabby, but it is also a compromise made on the basis of maintaining flexibility and versatility.

4x4 matrix operation implemented by TensorCore in Nvidia V100

NVIDIA executives declared to customers: ** "Volta is not an upgrade of Pascal, but a brand new architecture."**

Google also races against time. After 2016, the TPU has been updated for 3 generations within five years. It launched TPUv2 in 2017, TPUv3 in 2018, and TPUv4 in 2021, and put the data on the face of Nvidia. [4] : **TPU v4 is 1.2-1.7 times faster than Nvidia's A100, while reducing power consumption by 1.3-1.9 times. **

Google does not sell TPU chips to the outside world, and at the same time continues to purchase Nvidia's GPUs in large quantities, which makes the AI chip competition between the two stay in the "cold war" rather than "open competition". But after all, Google deploys the TPU in its own cloud service system to provide AI computing power services to the outside world, which undoubtedly reduces Nvidia's potential market.

Google CEO Sundar Picha demonstrates TPU v4

While the two are "fighting in the dark", the progress in the field of artificial intelligence is also making rapid progress. In 2017, Google proposed the revolutionary Transformer model, and OpenAI then developed GPT-1 based on Transformer. The arms race of large models broke out, and the demand for AI computing power ushered in the second acceleration since the emergence of AlexNet in 2012. .

After realizing the new trend, Nvidia launched the Hopper architecture in 2022, introducing the Transformer acceleration engine at the hardware level for the first time, claiming that it can increase the training time of the Transformer-based large language model by 9 times. Based on the Hopper architecture, Nvidia launched the "most powerful GPU on the surface" - H100.

H100 is Nvidia's ultimate "stitch monster". On the one hand, it introduces various AI optimization technologies, such as quantization, matrix calculation (Tensor Core 4.0) and Transformer acceleration engine; on the other hand, it is full of Nvidia's traditional strengths, such as 7296 CUDA Core, 80GB of HBM2 memory and up to 900GB/s NVLink 4.0 connection technology.

Holding the H100 in hand, Nvidia breathed a sigh of relief temporarily. There is no mass-produced chip on the market that is better than the H100.

Google and Nvidia’s secret see-saw is also a mutual achievement: Nvidia has imported a lot of innovative technologies from Google, and Google’s cutting-edge research on artificial intelligence has also fully benefited from the innovation of Nvidia’s GPU. The force is reduced to a level that can be used by a large language model "on tiptoe". Those who are in the limelight, such as OpenAI, are also standing on the shoulders of these two.

But feelings belong to feelings, and business belongs to business. The offensive and defensive battle around the GPU has made the industry more certain of one thing: **GPU is not the optimal solution for AI, and customized ASICs have the possibility of breaking the monopoly of Nvidia. **The cracks have been opened, and Google will not be the only one following the taste.

**Especially computing power has become the most certain demand in the AGI era, and everyone wants to sit at the same table with NVIDIA when eating. **

03, a crack that is expanding

In addition to OpenAI, there are two out-of-the-box companies in this round of AI boom. One is the AI drawing company Midjourney, whose ability to control various painting styles makes countless carbon-based artists frightened; the other is Authropic, whose founder is from OpenAI. The dialogue robot Claude played back and forth with ChatGPT.

**But neither of these two companies purchased Nvidia GPUs to build supercomputing, but used Google's computing services. **

In order to meet the explosion of AI computing power, Google built a supercomputer (TPU v4 Pod) with 4096 TPUs. The chips are interconnected with self-developed optical circuit switches (OCS), which can not only be used to train their own LaMDA, Large language models such as MUM and PaLM can also provide cheap and high-quality services to AI startups.

GoogleTPU v4 Pod supercomputing

There is also Tesla who DIYs supercalculators by himself. After launching the vehicle-mounted FSD chip, Tesla demonstrated to the outside world the supercomputer Dojo ExaPOD built with 3,000 of its own D1 chips in August 2021. Among them, the D1 chip is manufactured by TSMC, using 7nm technology, and 3,000 D1 chips directly make Dojo the fifth largest computing power computer in the world.

**However, the combination of the two cannot compare to the impact brought by Microsoft's self-developed Athena chip. **

Microsoft is one of Nvidia's largest customers. Its own Azure cloud service has purchased at least tens of thousands of A100 and H100 high-end GPUs. SwiftKey and other products that use AI.

After careful calculation, the "Nvidia tax" that Microsoft has to pay is an astronomical figure, and self-developed chips are almost inevitable. Just like Ali calculated Taobao Tmall's future demand for cloud computing, databases, and storage, and found that it was an astronomical figure, so it decisively began to support Alibaba Cloud, and launched a vigorous "de-IOE" campaign internally.

** Cost saving is one aspect, and vertical integration to create differentiation is another aspect. **In the era of mobile phones, the CPU (AP), memory and screen of Samsung mobile phones are self-produced and sold, making great contributions to Samsung's global Android hegemony. Google and Microsoft's core-making also carry out chip-level optimization for their own cloud services to create differences.

Therefore, unlike Apple and Samsung, which do not sell chips to the outside world, although Google and Microsoft’s AI chips will not be sold to the outside world, they will digest some of Nvidia’s potential customers through “AI computing power cloud services”. Midjourney and Authropic are examples. There are more small companies (especially at the AI application layer) choosing cloud services.

**The concentration of the global cloud computing market is very high. The top five manufacturers (Amazon AWS, Microsoft Azure, Google Cloud, Alibaba Cloud and IBM) account for more than 60%, and they are all making their own AI chips. Among them, Google is making the fastest progress , IBM has the strongest reserves, Microsoft has the greatest impact, Amazon has the best secrecy, and Ali has the most difficulties. **

Domestic major manufacturers develop their own chips, and the ending of Oppo Zheku will cast a shadow on every player who enters the field. However, large overseas companies do self-research, and talent and technology supply chains can be built with funds. For example, when Tesla engaged in FSD, it recruited Silicon Valley god Jim Keller, and Google developed TPU and directly invited Turing. Award winner, RISC architecture inventor Professor David Patterson.

In addition to large manufacturers, some small and medium-sized companies are also trying to take away Nvidia's cake, such as Graphcore, which once had a valuation of 2.8 billion US dollars, and the domestic Cambrian also belongs to this category. The following table lists the more well-known start-up AI chip design companies in the world.

The difficulty for AI chip start-ups is that without the continuous investment of large companies with strong financial resources, they cannot self-produce and sell themselves like Google. Unless the technical route is unique or the advantages are particularly strong, there is basically no chance of winning when fighting with Nvidia. The latter’s Cost and ecological advantages can almost smooth out all doubts of customers.

**Start-up’s impact on Nvidia is limited, and Huang Renxun’s hidden worries are still those big customers who are dishonest. **

Of course, major manufacturers are still inseparable from Nvidia. For example, even though Google’s TPU has been updated to the 4th generation, it still needs to purchase GPUs in large quantities to provide computing power in conjunction with the TPU; Choose to purchase 10,000 GPUs from NVIDIA.

However, Huang Renxun has already experienced the plastic friendship of major manufacturers in Musk. In 2018, Musk publicly announced that he would develop his own car chip (Nvidia's DRIVE PX was used at the time). Huang Renxun was questioned by analysts on the spot in a conference call, and he couldn't get off the stage for a while. Afterwards, Musk issued a "clarification", but a year later Tesla still left Nvidia without looking back [5] 。

Big factories have never shown mercy in saving costs. Although Intel's chips are sold to the B-end in the PC era, consumers have a strong choice of autonomy, and manufacturers need to advertise "Intel Inside"; but in the computing power cloud era, giants can block all underlying hardware information, and they will also buy in the future. With 100TFlops computing power, can consumers tell which part comes from the TPU and which part comes from the GPU?

Therefore, Nvidia finally has to face the question: **GPU is indeed not born for AI, but will GPU be the optimal solution for AI? **

Over the past 17 years, Huang Renxun has separated the GPU from a single game and image processing scene, making it a general-purpose computing power tool. New scenarios continue to "magically modify" the GPU, trying to find a balance between "generality" and "specificity".

In the past two decades, Nvidia has introduced countless new technologies that have changed the industry: CUDA platform, TensorCore, RT Core (ray tracing), NVLink, cuLitho platform (computing lithography), mixed precision, Omniverse, Transformer engine ... These technologies have helped Nvidia from a second-tier chip company to a Nanbo wrist in the market value of the entire industry, which is not inspiring.

But a generation should have a computing architecture of an era. The development of artificial intelligence is advancing rapidly, and technological breakthroughs are measured in hours. If you want AI to penetrate human life as much as it did when PCs/smartphones became popular, then computing power Costs may need to drop by 99%, and GPUs may indeed not be the only answer.

**History tells us that no matter how prosperous an empire may be, it may have to be careful about that inconspicuous crack. **

References

[1] ImageNet Classification with Deep Convolutional Neural Networks, Hinton

[2] Microsoft Readies AI Chip as Machine Learning Costs Surge, The Information

[3] High Performance Convolutional Neural Networks for Document Processing

[4] Google’s Cloud TPU v4 provides exaFLOPS-scale ML with industry-leading efficiency

[5] Tesla's AI ambitions, Tokawa Research Institute

[6] Large-scale Deep Unsupervised Learning using Graphics Processors

View Original
The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments