As recently as two years ago, AI/ML workloads ran almost exclusively on server-class MPUs and server-based GPU accelerators, even though server-class MPUs and GPUs are not very power-efficient when it comes to neural-network (NN) processing. The lack of efficiency results from a design philosophy that emphasizes raw MPU and GPU compute performance, achieved through very high clock rates, rather than on compute performance per watt. This design approach caters to typical server workloads and depends on data-center power and cooling capabilities.
Many tasks on the edge and in endpoint devices—especially Deep Neural Network (DNN) and Convolutional Neural Network (CNN) AI/ML inferencing tasks—require an optimal blend of processing performance and power efficiency. To date, AI/ML workloads such as DNNs and CNNs have run almost exclusively on server-class MPUs and server-based GPU accelerators, even though server-class MPUs and GPUs are not particularly power-efficient when it comes to AI/ML processing. This lack of efficiency is the direct result of a server-class design philosophy that emphasizes MPU and GPU compute performance at any price rather than on compute performance per watt.