A brand new fine-tuning efficiency benchmark for BridgeTower, a Imaginative and prescient-Language (VL) AI mannequin, has proven that there is life to the AI acceleration camp aside from Nvidia’s inexperienced. Whereas Nvidia does dominate the AI acceleration market (by means of distinctive foresight, a well-thought-out and documented software program stack, and pure processing efficiency), different gamers are eager to take a chunk of the AI marketplace for themselves. And not less than for BridgeTower, Intel’s personal Gaudi 2 silicon (designed and fabricated by means of Intel’s $2 billion, 2019 acquisition of Habana) has been proven by Hugging Face to outperform Nvidia’s A100 80 GB by a staggering 2.5x – and it even beats Nvidia’s prodigy-child H100 by 1.4x.
Imaginative and prescient-Language
Imaginative and prescient-Language (VL) refers to AI fashions that may course of and affiliate info throughout the modalities of language and visible illustration. VL fashions in particular are generally related to picture technology fashions equivalent to Open AI’s CLIP and Secure Diffusion XL – a fast-growing market that is being largely led by Midjourney, Secure Diffusion, and now Ideogram.
In response to Habana, the momentous speedups are the results of a hardware-accelerated data-loading system – one of many bottlenecks for AI mannequin fine-tuning, and especially-so for VL fashions. Loading a workload into reminiscence is commonly one a efficiency bottleneck wherever computing lies, so it isn’t that out of the left-field that Habana would look to optimize this specific step within the coaching course of.
The primary bottleneck pertains to how CPUs get hamered with many expensive operations equivalent to picture decoding and picture augmentation (the same challenge to the GPU draw-call debate), which lead the HPU (or Nvidia GPU) to stall whereas ready for additional information to be processed (by the CPU) after which despatched over to the AI accelerator of alternative. That is how the method goes with none hadrware acceleration:
- Fetch information (e.g. the place your JPEG pictures are saved on disk)
- The CPU reads encoded pictures
- The CPU decodes pictures
- The CPU applies picture transformations to reinforce pictures
- Photos are despatched to gadgets (though that is often not accomplished by the dataloader itself)
And that is the method by means of Gaudi 2’s built-in {hardware} acceleration, which accelerates picture transformation:
- Fetch information
- The CPU reads encoded pictures
- Encoded pictures are despatched to gadgets
- Units decode pictures
- Units apply picture transformations to reinforce pictures
By means of the {hardware} acceleration technique, it turns into clear that the CPU is far much less leveraged (liberating up CPU cycles for different duties throughout the fine-tuning principal course of), which ought to lead to improved efficiency.
Benchmarking Habana’s Gaudi 2 by fine-tuning a pre-trained BridgeTower checkpoint with 866M parameters permits us to see the efficiency positive aspects that hardware-accelerated picture loading brings to the desk. The workloads had been run in distributed computing throughout 8 gadgets every (of Nvidia’s A100 80 GB, H100, and Gaudi 2). The outcomes had been measured and averaged throughout three totally different processing runs, with every run spawning growing CPU processes absolutely devoted to loading information into reminiscence (the primary run hundreds reminiscence inside the primary CPU course of, whereas runs two and three improve the variety of memory-loading processes by one and two, respectively).
Machine | dataloader_num_workers=0 | dataloader_num_workers=1 | dataloader_num_workers=2 | dataloader_num_workers=2 + mediapipe_dataloader |
Gaudi 2 HPU | 601.5 | 747.4 | 768.7 | 847.7 |
H100 GPU | 336.5 | 580.1 | 602.1 | N/A |
A100 80 GB GPU | 227.5 | 339.7 | 345.4 | N/A |
The outcomes are clear: the best-case efficiency situation for Gaudi 2 is the primary, the place information is loaded alongside the primary coaching course of, with Gaudi 2 besting even Nvidia’s H100 by 1.79x, and the A100 by 2.23x. However it is a non-optimized situation, as Habana itself admitted; so maybe probably the most revealing outcomes come from the third datapoint, the place two extra processes had been spawned to deal with information loading outdoors of the primary fine-tuning course of. There, Nvidia’s merchandise definitely need to squint to catch Gaudi 2’s dust-cloud because it runs into the space: Gaudi 2 delivers the improved 1.3x efficiency towards Nvidia’s cream-of-the-crop H100, and a 2.23x efficiency enchancment towards the A100 80 GB.
It could be potential to spawn extra processes to deal with data-loading; however as it may be seen from the efficiency development, that technique would result in more and more diminishing returns. On the Nvidia H100, for example, efficiency is improved by 1.72x by spawning a single devoted data-loading course of, however going from one course of to 2 solely brings an extra 3% enchancment. Attributable to Habana’s capability to deliver most data-loading steps into Gaudi 2, nevertheless, the corporate can unlock an extra 10% efficiency enchancment towards its personal finest rating (the place information loading an transformations are dealt with by two CPU processes).
There’s nonetheless a protracted option to go earlier than any firm can declare hegemony within the AI-acceleration area. Nvidia has an unbelievable product and software program stack that has allowed it to realize the first-mover benefit; however we have seen sufficient races the place the underdogs catch as much as (and generally even surpass) the favorites to know that Intel, AMD and others are all seeking to steal Nvidia’s thunder.