Why DRPs Excel at Implementing AI/ML Applications for OT Markets

By: Mario Morales, IDC

Many tasks on the edge and in endpoint devices—especially Deep Neural Network (DNN) and Convolutional Neural Network (CNN) AI/ML inferencing tasks—require an optimal blend of processing performance and power efficiency. To date, AI/ML workloads such as DNNs and CNNs have run almost exclusively on server-class MPUs and server-based GPU accelerators, even though server-class MPUs and GPUs are not particularly power-efficient when it comes to AI/ML processing. This lack of efficiency is the direct result of a server-class design philosophy that emphasizes MPU and GPU compute performance at any price rather than on compute performance per watt. High performance is usually achieved through very high clock rates, which results in high power consumption. This performance-at-any-price design approach caters to typical server-class workloads and relies upon data-center power and cooling capabilities. It is not appropriate for edge or endpoint processing.  

AI applications running on the edge and in endpoints—in products ranging from wearables, to industrial and medical equipment, and onwards to connected vehicles—demand more power-efficient processing. Server-class MPUs and GPUs simply draw far too much power to be used in these OT (Operational Technology) markets, which include the embedded, mobile and industrial IoT products. IDC believes that an emerging class of solutions collectively called Dynamically Reconfigurable Processors (DRPs) address most of the key requirements needed to run AI/ML algorithms in the OT space.

DRPs can deliver high performance with highly adaptive flexibility for a wide variety of target applications using a large, reconfigurable array consisting of many processing elements (PEs). The DRP’s processing functions can be dynamically altered under software control to meet an application’s immediate and changing processing needs. A DRP’s PE array can be partitioned, grouped and configured to run multiple algorithms simultaneously. DRPs have been the subject of substantial academic research for more than a decade and recent announcements from multiple semiconductor vendors indicate that DRPs will become increasingly commonplace in many types of semiconductor devices ranging from large FPGAs and ASICs on the high-power end of the spectrum and all the way down to MCUs on the low-power end. This range covers the processing needs of many, many product types used for endpoints and on the edge.

Figure 1 shows a typical DRP. It consists of several PEs surrounded by local memory, which is used to store both raw input data and to processed output data in close proximity to the PEs. Closely locating these data buffers near the PE array reduces the energy needed to move that data into and out of the array and reduces memory-to-PE latency times. Both of those outcomes are beneficial to overall system power consumption and performance.

Figure 1: A typical DRP consists of an array of Processing Elements (PEs), SRAMs, and a DMA controller with a programmable interconnect to enable on-the-fly reconfiguration.

A DMA controller integrated into the DRP brings the raw data into the DRP from a host system and conveys processed data back to the host system. There are additional processing resources in a typical DRP design as well, which are used by the PEs. A programmable interconnect system integrated into the PE array enables hardware reconfiguration.

In some ways, a DRP looks similar to an FPGA. It integrates a tiled array of PEs connected through a dynamically reconfigurable interconnect in a manner that’s very similar to the way FPGAs incorporate Look-up Tables (LUTs) and programmable interconnect. In both cases, the tiles deliver high processing performance through parallelism. However, FPGAs have fine-grained parallelism with relatively simple LUTs and DRPs have coarse-grained parallelism using more complex PEs.

Each FPGA LUT is much less complex than processors; They’re really just simple logic elements. A design team must somehow fashion large arrays of on-chip LUTs into complex processing blocks using an appropriate hardware compiler and other complex design tools. Conversely, DRPs have a coarse-grained architecture based on an array of PEs. Each PE is a processor in its own right. An appropriate software compiler can harness one or more of the DRP’s PEs to execute specific algorithms. The compiler can generate multiple, loadable DRP configurations that can be used to partition the DRP’s PE array into sub-arrays that are then able to execute different portions of the application code in parallel, as illustrated in Figure 2 below.

Figure 2: The multiple PEs in a DRP can be configured to optimally execute different algorithms.

DRP sub-arrays can be dynamically reconfigured as the executing application moves from one section of active code to the next. It is therefore quite possible for executing application code to define and redefine the DRP’s processing architecture from moment to moment, as shown in Figure 2 above. The ability to configure and reconfigure the DRP on the fly effectively creates an infinite number of complex, hardware-accelerated processing functions in relatively small, low-cost systems. It’s possible to do something similar with LUTs and FPGAs but that design process is far more complicated.

DRPs have been successfully used as reconfigurable accelerators for a variety of AI/ML tasks that require DNNs (Deep Neural Networks) such as image processing and image recognition. Because of their configurable and quickly reconfigurable nature, DRPs can handle the processing requirements of both existing and yet-to-be-created neural networks. DRPs can be integrated into Application Specific Standard Products (ASSPs), Applications Specific Integrated Circuits (ASICs), and they can also be integrated into established MPU, MCU, and even FPGA product families.

IDC believes that, in the near future, currently unfolding OT market developments will drive embedded intelligence into billions of devices that require AI/ML inference processing, but with many additional constraints including low power consumption, low latency and real-time response. DRPs can viably address these requirements while still achieving good performance. DRPs can enable AI inference algorithms to run in real time on many endpoint systems, especially when integrated into other broadly adopted, general-purpose processing chips, particularly MCUs, which are already commonly designed into these types of systems.

This blog article is part of a series and is based on the IDC White Paper titled "Embedded Artificial Intelligence: Reconfigurable Processing Accelerates AI in Endpoint Systems for the OT Market."