

## **Energy-Efficient Heterogeneous Design**

Perugia

Luca Benini<sup>1,2</sup>

#### 04.09.2019



<sup>1</sup>Department of Electrical, Electronic and Information Engineering



European Commission Horizon 2020 European Union funding for Research & Innovation



#### $Cloud \rightarrow Edge \rightarrow Extreme Edge$





| 2



#### **Energy efficiency is THE Challenge**





Cool... But, HOW??

4

#### 2013: Parallel Ultra Low Power $\rightarrow$ PULP!



Near-Threshold Computing (NTC):

- **1.** Don't waste energy pushing devices in strong inversion
- 2. Recover performance with parallel execution
- **3.** Manage Leakage, PVT variability and SRAM limitations NT!!!



#### **Near-Threshold Multiprocessing**



Need Strong ISA, Need full access to "deep" core interfaces, need to tune pipeline! OPEN ISA: RISC-V RV32IMC + New, Open Microarchitecture  $\rightarrow$  RI5CY!



D. Rossi *et al.*, "Energy-Efficient Near-Threshold Parallel Computing: The PULPv2 Cluster," in *IEEE Micro*, Sep./Oct. 2017.

#### **Bespoke ISA needed! Enter Xpulp extensions**

<32-bit precision  $\rightarrow$  SIMD2/4  $\rightarrow$  x2,4 efficiency & memory size

Risc-V ISA is extensible *by construction* (great!)

- V1 Baseline RISC-V RV32IMC HW loops
- V2 Post modified Load/Store Mac
- V3 SIMD 2/4 + DotProduct + Shuffling Bit manipulation unit Lightweight fixed point (EML centric)



7

#### 25KG → 40KG (1.6x)



M. Gautschi et al., "Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices," in IEEE TVLSI, Oct. 2017.

#### **RI5CY – are xPULP ISA Extensions (1.6x) worthwhile?**



#### **Results: RV32IMCXpulp vs RV32IMC**



PULP-NN: an open Source library for DNN inference on PULP cores



#### The Evolution of the 'Species'





#### Enter Zero/Micro-riscy, small core for control



- Only 2-stage pipeline, simplified register file
- Zero-Riscy (RV32-ICM), 19kGE, 2.44 Coremark/MHz
- Micro-Riscy (RV32-EC), 12kGE, 0.91 Coremark/MHz
- Used as SoC level controller in newer PULP systems



#### Different cores for different types of workload



#### The IoT Processor: Mr Wolf





#### Mr. Wolf Chip Results: Heterogeneous Computing Works

| Technology         | CMOS 40nm LP       |
|--------------------|--------------------|
| Chip area          | 10 mm <sup>2</sup> |
| VDD range          | 0.8V - 1.1V        |
| Memory Transistors | 576 Kbytes         |
| Logic Transistors  | 1.8 Mgates         |
| Frequency Range    | 32 kHz – 450 MHz   |
| Power Range        | 72 μW – 153 mW     |

| Power Managent<br>(DC/DC + LDO) | VDD [V]   | Freq.                       | Power            |
|---------------------------------|-----------|-----------------------------|------------------|
| Deep Sleep                      | 0.8       | n.a.                        | 72 µW            |
| Ret. Deep Sleep                 | 0.8       | n.a                         | 76.5 - 108<br>mW |
| SoC Active                      | 0.8 - 1.1 | 32 kH<br>450 N              | 0.97 -<br>38 mW  |
| Cluster Active                  | 0.8 - 1.1 | 32 kH <del>7</del><br>350 N | 1.6 -<br>153 mW  |



A. Pullini, D. Rossi, I. Loi, A. Di Mauro, L. Benini, "Mr.Wolf: a 1 GFLOP/S Energy-Proportional Parallel Ultra Low Power SoC for IoT Edge Processing", ESSCIRC 2018.



#### More efficiency: Heterogeneous PULP Cluster



**-**15

#### **HW Convolution Engine**





F. Conti and L. Benini, "A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters," *Design, Automation & Test in Europe Conference & Exhibition (DATE)*, 2015, pp. 683-688.

#### **HWCE Sum-of-Products**



#### **Heterogeneous PULP CNN Performance**



Now coming: HWCE 2.0 – improves scalability & flexibility @ 3TOPS/W



## PULP cluster+MCU+HWCE(V1) → GWT's GAP8 (55 TSMC)

Two independent clock and voltage domains, from 0-133MHz/1V up to 0-250MHz/1.2V



| What                | Freq MHz | Exec Time ms     | Cycles     | Power mW      |                    |
|---------------------|----------|------------------|------------|---------------|--------------------|
| 40nm Dual Issue MCU | 216      | 99.1             | 21 400 000 | <sup>60</sup> | 16                 |
| GAP8 @1.0V          | 15.4     | 99.1 <b>11 X</b> | 1 500 000  | 3.7           | GREENWAVES         |
| GAP8 @1.2V          | 175      | 8.7 🔸            | 1 500 000  | 70            | a and a set of the |
| GAP8 @1.0V w HWCE   | 4.7      | 99.1             | 460 000    | 0.8           |                    |

**4x More efficiency at less than 10% area cost** 



| 19

#### **New Application Frontiers: DroNET on NanoDrone**



Only onboard computation for autonomous flight + obstacle avoidance no human operator, no ad-hoc external signals, and no remote base-station!

#### More Efficiency (2): Extreme Quantization

| Model         | Bit-width   | Top-1 error | SOA INQ retraining                 |
|---------------|-------------|-------------|------------------------------------|
| ResNet-18 ref | 32          | 31.73%      |                                    |
| INQ           | 5           | 31.02%      |                                    |
| INQ           | 4           | 31.11%      |                                    |
| INQ           | 3           | 31.92%      |                                    |
| INQ           | 2 (ternary) | 33.98%      | 2.2% loss $\rightarrow$ 0% with 20 |

Low(er) precision:  $8 \rightarrow 4 \rightarrow 2$ 





1 MAC Op = 2 Op (1 Op for the "sign-reverse", 1 Op for the add).



21

#### From +/-1 Binarization to XNORs

$$y(k_{out}) = \text{binarize}_{\pm 1} \left( \mathbf{b}_{k_{out}} + \sum_{k_{in}} \left( \mathbf{W}(k_{out}, k_{in}) \otimes \mathbf{x}(k_{in}) \right) \right)$$
  

$$\text{binarize}_{\pm 1}(t) = \text{sign} \left( \gamma \frac{t - \mu}{\sigma} + \beta \right)$$
  

$$\text{binarize}_{0,1}(t) = \begin{cases} 1 \text{ if } t \ge -\kappa/\lambda \doteq \tau, \text{ else } 0 \quad (\text{when } \lambda > 0) \\ 1 \text{ if } t \le -\kappa/\lambda \doteq \tau, \text{ else } 0 \quad (\text{when } \lambda < 0) \end{cases}$$
  

$$\mathbf{y}(k_{out}) = \text{binarize}_{0,1} \left( \sum_{k_{in}} \left( \mathbf{W}(k_{out}, k_{in}) \otimes \mathbf{x}(k_{in}) \right) \right)$$
  
Thresholding  
Multi-bit accumulation



| 22

#### **XNE: XNOR Neural Engine**



Main unit: binary dot-product and thresholding



#### **Quentin: a XNE-accelerated microcontroller**

#### Quentin in GlobalFoundries 22FDX



F. Conti, P. D. Schiavone and L. Benini, "XNOR Neural Engine: A Hardware Accelerator IP for 21.6-fJ/op Binary Neural Network Inference," in *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 37, no. 11, pp. 2940-2951, Nov. 2018.



#### **XNE Energy Efficiency**



Accuracy Loss is high even with retraining (10%+) → mixed precision TWN & TCN are also a very appealing alternative (under design)



25

#### Not Only CNNs: Hyper-Dimensional Computing





#### More efficiency (3): HD-Based smart Wake-Up Module



#### Taped out in 22fdx



#### More Efficiency (4): Focal Plane Processing

Enable the extraction of low-level features in a parallel and efficient way by **integrating pixel-wise mixed-signal processing circuits** on the sensor die **to reduce the imager energy costs**.



Fernández-Berni, Jorge, et al. "Image Feature Extraction Acceleration." Image Feature Detectors and Descriptors. Springer International Publishing, 2016. 109-132.

#### **Ultra-Low Power Imaging (GrainCam)**



This process naturally reflects the operation of a binarized pixel-wise convolution and can be seen as embedding the first convolutional layer within the image sensor die

M. Gottardi et al, "A 100uw 12864 pixels contrast-based asynchronous binary vision sensor for sensor networks applications," IEEE JSSC, 2009.



#### **Combinational "Fully Spatial" BNN**





#### **Synthesis Results**

#### Synthesis of both models with hard-wired or reconfigurable weights

GF 22nm SOI with LVT cells (typical corner case 0.65V, 25°C)

|              | Synt                                                        | HESIS                          | AND POV                      | VER RESU                       | TABLE                            | II<br>R DIFFERE                | NT CON                         | FIGURA                     | TIONS                            |
|--------------|-------------------------------------------------------------|--------------------------------|------------------------------|--------------------------------|----------------------------------|--------------------------------|--------------------------------|----------------------------|----------------------------------|
|              | netw.                                                       | type                           | $a$ $[mm^2]$                 | area ——<br>[MGE] <sup>†</sup>  | — tir<br>[ns]                    | ne/img —<br>[FO4] <sup>‡</sup> | E/img<br>[nJ]                  | leak.<br>[µW]              | E-eff.<br>[TOp/J]                |
|              | $16 \times 16$ $16 \times 16$ $32 \times 32$ $32 \times 32$ | var.<br>fixed<br>var.<br>fixed | 1.17<br>0.46<br>5.80<br>2.61 | 5.87<br>2.32<br>29.14<br>13.13 | 12.82<br>12.40<br>17.27<br>21.02 | 560<br>541<br>754<br>918       | 2.40<br>1.68<br>11.14<br>11.67 | 945<br>331<br>4810<br>1830 | 470.8<br>672.6<br>479.4<br>457.6 |
| $\backslash$ | † Two-<br>‡ Fano                                            | input I<br>ut-4 de             | NAND-ga<br>elay: 1 F0        | te size eq $04 = 22.8$         | uivalent<br>89 ps                | $: 1 \mathrm{GE} =$            | $0.199\mu$                     | $\mathrm{m}^2$             |                                  |

Hundreds of TOPS/W!

Massive area reduction when hard-wiring the weights:

- XNOR operations reduce to wires or inverter, which can be also shared among different receptive fields
- · popcounts also exploits sharing mechanisms

## Advanced Synthesis Tools become central to exploit weights and intermediate results sharing to reduce the area occupation



M Rusci, L Cavigelli, L Benini "Design automation for binarized neural networks: A quantum leap Opportunity?" 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 1-5

#### What about Security? A Secure EE AI Processor



[3] F. Conti et al., An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics, IEEE TCAS-I 2017



#### Data security: HWCrypt – a Cryptographic Accelerator



[1] T. Unterluggauer et al., Leakage Bounds for Gaussian Side Channels, CARDIS 2017



HWCrypt is a «collection» of two crypto engines plugged to the shared memory and controlled via the periph interconnect

- AES Engine
  - AES-128-ECB: fast but not secure (plaintext patterns are ~visible in ciphertext)– for comparison!
  - AES-128-XTS: each block encrypted with a different tweak – just as fast in the HWCrypt
  - individual execution of cipher rounds (to speed up new SW-based AESbased algorithms)
- Sponge Engine
  - two instances of Keccak-f[400]
  - leakage-resilient encryption scheme [1]
  - similar performance to AES engine

#### Fulmine SoC

Fulmine: Hardware Convolutional Engine (HWCE) in the Cluster





F. Conti et al., "An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics,",in IEEE TCAS-I Sept. 2017.

#### Secured ResNet-18: execution time & energy



- DPA-secure encryption of all communication: weights + intermediate CNN results
- Up to 70x speedup w.r.t. microcontroller, 15x w.r.t. to pure SW
- 20x improvement in energy (diminishing return region: this is "good enough")
- Performance up to 3s / frame ; 50 mJ / frame



#### **Reconfigurable Heterogeneity: Arnold**

- RI5CY RISC-V 32b CORE RV32IMFC + pulp extensions 3,19 CoreMark/MHz Memory protection
- Autonomous I/O Subsystem
- 512 kB of Memory

   4 Interleaved BANKS of 112 kB
   7 cuts of 4096x32 SRAM
   1 BANK of 32 kB
   2 cuts of 4096x32 SRAM
   1 BANK of 32 kB
   2 cuts of 4096x32 SRAM
- 8 kB ROM
- **3 FLLs** Core, Peripherals and FPGA
- embedded FPGA

32x32 array 32x32x4x 4in-LUT = 4096 4inLUT 1024 registers 6 clocks



#### Arnold: MCU+ Embedded FPGA



#### **Arnold Physical View**



#### Many applications need 64-bit "numbers"

#### • For the first 4 years of the PULP project we used only 32bit cores

- Most IoT applications work well with 32bit cores.
- A typical 64bit core is much more than 2x the size of a 32bit core.

#### But times change:

- Large datasets, high-precision numerical calculations (e.g. double precision FP) at the IoT edge and cloud
- Lot of interest in the security community for working on a contemporary open source 64bit core.
- High performance computing (FP intensive) is becoming again a hot area for Architecture and Digital design research



#### ARIANE: >1GHz, Linux Capable 64-bit core





#### **Main properties of Ariane**

- Tuned for high frequency, 6 stage pipeline, integrated cache
  - In order issue, out-of-order write-back, in-order-commit
  - Supports privilege spec 1.11, M, S and U modes
  - Hardware Page Table Walker
- Implemented in GF 22FDX (Poseidon, Kosmodrom, Baikonur), and UMC65 (Scarabaeus)
  - In 22nm: ~1 GHz worst case conditions (SSG, 125/-40C, 0.72V), 1.7GHz typ @0.8V
  - 8-way 32kByte Data cache and 4-way 32kByte Instruction Cache
  - Core area: 175 kGE
- Application-class features are not cheap
  - 38% area in TLB, PTW
  - **51.8**pJ/op vs. 10pJ/OP in 22FDX @ 0.8V
  - IPC 0.85 vs. 0.94, 1.7GHz vs. 690, just 2.1 faster!





#### **Extreme FP Performance: The "V" Extension**



#### **Extreme FP Performance: The "V" Extension**



#### Implementation results in a 0.75mm x 1.25mm GF22 macro





- Post-synthesis PPA results
- WC operating frequency similar to Ariane
- Area: 3188 kGE
  - Each lane amounts to 533 kGE
  - Ariane (wo. \$s) amounts to 474 kGE
- For a 256×256 integer MATMUL
  - Performance: 10.2 DP-GFLOPS
  - Power consumption: 192 mW
  - Energy efficiency: 53 DP-GFLOPS/W
- 3.1x GOPS/W wrt Ariane, at same frequency

#### Up to 98% utilization @ *n* × *n* DP-MATMUL (always?)





Matheus Cavalcante | 05.09.2019 | 44

#### Floating-Point $\rightarrow$ Transprecision FP

- Provide easy precision tuning
  - = 64(DFP), 32(FP), 16(HFP), 16ALT, 8
- Mainly consists of four operation groups
  - MUL/ADD: Add/Subtract, Multiply, FMA
  - CMP/SGNJ: Comparisons, Min/Max etc.
  - CAST: FP-FP casts, Int-FP / FP-Int casts

#### Parametrizable

- Number & Encoding of Formats (any Exp/Man bits)
- Packed-SIMD Vectors
- # Pipeline Stages (per Op and Format)
- Implementation (per Op and Format)
  - PARALLEL for best Speed
  - MERGED (or Iterative) for best Area
- Special Functions for Transprecision
  - Cast-and-Pack 2 FP Values to Vector
  - Casts amongst FP Vectors + Repacking
  - Expanding FMA (e.g. FP32 += FP16\*FP16)





Stefan Mach | 05.09.2019 | 45

### **Result Highlights**

- While TP FPU adds 9% of Ariane core area vs RV64D, ...
- Super-Linear energy savings thanks to aggressive clock-gating
  - Mutually exclusive data paths rather than sharing



- ~pJ/FLOP @1GHz in 22FDX
  - 0.4pJ FP8, 0.9pJ FP16, 2.4pJ FP32, 6.2pj FP64
- Transprecision applications will profit from this additional HW
- Fully integrated into RISC-V ISA through custom extension
  - Easy to leverage thanks to our GCC extensions, part of PULP SDK



#### Heterogeneous RISC-V platform from ULP to HPC



#### **OpenPiton: cache-coherent many-core system**

## OpenPiton

- Developed by Princeton
- Originally OpenSPARC T1
- Scalable NoC with coherent LLC
- Tiled Architecture

## Status

- Bare-metal Dec '18
- Update with support for SMP
   Linux just released
- Multiple different cores and ISAs (x86, SPARC, RISC-V)





ISA heterogeneity with a cachecoherent memory hierarchy



#### Hero: Fat (multi) core host, slim manycore accelerator



- First released in 2018
- Many-core PULP clusters connected with a general-purpose fat-core host with heterogeneous ISA – shared virtual memory (non coherent)





#### HERO v3: Heterogeneous 64-32b RISC-V

#### The best of both worlds?



Leading innovation with:

- lightweight shared virtual memory (SVM)
- distributed atomic transactions
- heterogeneous 64/32-bit LLVM toolchain
- support for predictable execution (PREM)



**DDR DRAM** 

off-chip DRAM

### HERO v3 First Silicon: Urania

- first HERO ASIC first fully-open source linux Booting risc-V SoC in the world
- 2 PULP clusters, each with
  - 4 RV32 RI5CY cores
  - 4 transprecision FPUs
  - I PULPO accelerator
  - 64 KiB TCDM in 8 banks
- Ariane RV64 host processor
- 128 KiB Shared LLC
- software-managed IOMMU
- DDR3 DRAM Controller + PHY





UMC 65nm LL

16 mm<sup>2</sup> die area, ca. 9 mm<sup>2</sup> logic core area

ca. 6 MGE logic core complexity, ca. 400 KiB SRAMs in total

# What's next?



## @pulp\_platform http://pulp-platform.org

#### Heterogeneous computing toward post-exascale

- Peak compute (GPU) 15TFLOP/s at 300W
  - 20x Better needed for post exascale: 1TWFLOP/W
- Only 5% power estimated to be spent in the FPUs [1]:
  - [1] reports 2.9%, but their kernels don't reach TDP/max perf.
  - In dubio pro Invidia: We scale power to assume modern GPUs do not exceed TDP at max perf. (making them more efficient)
  - Key issue: GPU RF is SRAM: FMUL32 4pJ, SRAM 20pJ

|        | Shared        |
|--------|---------------|
|        | Memory        |
|        | Const_SM      |
|        | Const Cache   |
|        | Texture Cache |
|        | FDS           |
|        | ALU           |
|        | INT INT       |
|        | SFU           |
| +•••   | REG           |
| orn NG | FP            |
| ~ ~    | Idlepower     |
|        | on Alo        |

210

| Graph extracted | and | cropped | from | [1]. |
|-----------------|-----|---------|------|------|
|-----------------|-----|---------|------|------|

| 64 FPUs                                      |                                                |    |
|----------------------------------------------|------------------------------------------------|----|
| <mark>256 kB RF</mark><br>128 kB L0<br>Cache | Volta Assembly<br>LDS R2, [R0]<br>LDS R3, [R1] |    |
| 32-2048 threads                              | FFMA R4, R2, R3,                               | R2 |

2 mem. acc. ("[...]") 8 reg. acc. Into RF SRAM = 10 SRAM R/W total

[1] S. Hong and H. Kim, "An integrated gpu power and performance model," in ACM SIGARCH Computer Architecture News, 2010.



53

#### **Network Training Accelerator (NTX)**



Again: specialized "deep interfaces" + Instruction extensions



#### NTX Power Breakdown & GPU SM Comparison

- NTX dissipates significant fraction of power in its FPU (more is better):
  - 31% of cluster
  - 14% of entire if we account for Main Mem
  - Recall: GPU is just around 5% [1]



- Compared to NVIDIA Volta GPU [2]:
  - Register file in GPU holds registers and thread-local data
  - Each register read/write is an SRAM access
  - Register and data accesses compete for SRAM

| Volta SM         | 8 NTX cl    | Volta Assemb                      |
|------------------|-------------|-----------------------------------|
| 4 FPUs           | 64 FPUs     | LDS R2, [R0<br>LDS R3, [R1        |
| 256 kB RF        | 512 kB TCDM | FFMA R4, R2                       |
| 28 KB LU<br>ache |             | 2 mem. acc. ("[…]"<br>8 reg. acc. |
| 2-2048 threads   | 8 threads   |                                   |
|                  |             |                                   |

| Volta Assembly                                      | NTX Pseudocode                                                |
|-----------------------------------------------------|---------------------------------------------------------------|
| LDS R2, [R0]<br>LDS R3, [R1]<br>FFMA R4, R2, R3, R2 | FMAC accu, [AGU0], [AGU1]                                     |
| 2 mem. acc. ("[…]")<br>8 reg. acc.                  | 2 mem. acc. ("[…]")<br>0 reg. acc.<br>(+ addr. calc for free) |
| = 10 SRAM hits total                                | = 2 SRAM hits total                                           |



### **NTX Roofline and efficiency**

Performance [Gflop/s]

- NTX achieves high utilization of available bandwidth and compute
- We investigate a range of different kernels:
- Linear Algebra
  - Mat-Mat product (GEMM)
  - Mat-Vec product (GEMV)
  - Vector sum (AXPY)
- Stencils
  - Discrete Laplace Operator in 1D/2D/3D
  - Diffusion
- **Deep Learning**
- 2 to 3x more efficient than GPGPU





#### **Technology to the rescue**

Reduce pJ/B to access main mem

What about the ~30% of power that goes in the memory interface?



Intel's upcoming 3D-stacked processor, codename Lakefield

Reduce the number of accesses...





#### **Industrial open SW Hardware**

IowRISC Community Interest Company



enabling open source silicon through collaborative engineering





#### LowRISC is up and... hiring



Alex Bradbury, Dr Gavin Ferris, Dr Robert Mullins Prof. Luca Benini, Ron Minnich, Dominic Rizzo





59

#### Will one NFP Company be Enough?



60

#### **OpenHW Group Charter**

**OpenHW Group** is a not-for-profit, global organization driven by its members and individual contributors where hardware and software designers collaborate in the development of open-source cores, related IP, tools and software such as the **CORE-V Family of cores**. OpenHW provides an infrastructure for hosting high quality open-source HW developments in line with industry best practices.



R. O'Connor (OpenHW CEO, former RISC-V foundation director)





www.pulp-platform.org



## The fun is just beginning...



# **Questions?**



## @pulp\_platform http://pulp-platform.org