#### A 90nm Power Optimization Methodology and its' Application to the ARM1136JF-S Microprocessor

A. Khan<sup>+</sup>, P. Watson<sup>\*</sup>, G. Kuo<sup>+</sup>, D. Le<sup>+</sup>, T. Nguyen<sup>+</sup>, S. Yang<sup>+</sup>, P. Bennet<sup>+</sup>, P. Huang<sup>+</sup>, J. Gill<sup>+</sup>, D. Wang<sup>+</sup>, I. Ahmed<sup>+</sup>, P. Tran<sup>+</sup>, H. Mak<sup>+</sup>, O. Kim<sup>+</sup>, F. Martin<sup>+</sup>, Y. Fan<sup>+</sup>, D. Ge<sup>+</sup>, J. Kung<sup>+</sup>, V. Shek<sup>+</sup>

\* ARM Ltd.

+ Cadence Design Systems, Inc.

#### 2005 IEEE Custom Integrated Circuits Conference San Jose, CA. September 21, 2005

Sponsored by the Institute of Electrical and Electronics Engineers Solid-State Circuits Society (IEEE SSCS) and co-sponsored by the IEEE Electron Devices Society

IEEE CICC 2005; paper 27-2

© IEEE; September 21, 2005

# Topics

#### Introduction

- System and IC architecture
- Electrical/physical design methodology
  - RTL synthesis for multiple supply voltage operation and leakage/speed optimization
  - $-V_{\text{DD}}$  selection and power distribution
  - Clock gating
- System-level design validation
- Summarized parametric data
- Summary
- Acknowledgments and references

### Introduction

- SOC designs developed for consumer electronics products: Energy-efficient design critical
- Develop design methodology and techniques for power reduction
  - Focus on methodologies leverageable to synthesis-driven digital designs
    - IC: 355 MHz (typ.) in 90 nm CMOS
      - ARM1136JF-S microprocessor → higher frequency operation
    - Dual  $V_{DD}$  domains, dual- $V_T$  cell libraries
- Collaborative effort (ARM, Cadence and TSMC)
  - SoC designed and manufactured
  - Measured power dissipation results (instrumented system)
    - Gaming and other applications utilized for evaluation
- Results to be presented

## **System architecture**

- ARM RealView<sup>®</sup> Validation System
  - Instrumented system
- Run applications, measure performance
  - Games, other; power dissipation



# IC architecture



- ARM1136JF-S microprocessor
  - 16k data + instruction cache
  - 16 kB TCM
  - Additional Tag RAMs and TLBs
  - 44 memory instances
  - ARM and Thumb instruction sets
  - Extended DSP instructions
  - Jazelle enabled technology
    - Direct execution of Java byte codes
- ETM11 trace macro, ETB11 trace buffer
- Advanced high-performance bus (AHB) bus fabric
  - Connect core AHB Lite ports to full AHB interface (pin accessible)
  - Access to 128 KB on-chip test RAM: Enable concurrent data transfers from any four ports

# IC, design overview



- 300 K standard cell instances
- 44 SRAM instances
- 355 MHz typical operation in 90nm standard CMOS

- Electrical/physical design
  - -ARM Artisan Physical IP
  - RTL synthesis for multiple supply voltage operation and leakage/speed optimization
  - -V<sub>DD</sub> selection and power distribution
  - Clock gating
  - $\begin{array}{l} \text{Timing (electrical) closure} \\ \text{in a multi-V}_{\text{DD}} \ \text{domain} \\ \text{design (including ECOs)} \end{array}$

# **Design methodology overview**

Timing, Power and Area Optimization

**RTL Clock Gating Dynamic & Leakage Power** Optimization Multi-Supply voltage (MSV) Floorplanning **MSV-Aware Power Planning and Power Mesh Optimization Placement and Level Shifter** Insertion Multi-Vt, Multi-Supply Voltage **Aware Physical Optimization Low-Power Clock Tree Synthesis MSV-Aware Power Routing** Timing, Crosstalk, MSV-Diven **Signal Routing Post-Route Timing, Crosstalk** and Multi-Vt Optimization Leakage and Dynamic IR Drop and EM Verification

Accurate MSV Delay Calculation and Crosstalk Analysis  Methodology leverageable to synthesized digital designs

 Newly-developed and currentgeneration EDA tools

- Single-pass concurrent optimization (timing, power, area)
- Instance-specific numerical modeling of delay at multiple  $V_{DD}$  levels
  - <2% variance with respect to full circuit simulation



IEEE CICC 2005; paper 27-2

### **Microprocessor verification**

- Set code and memory configurations for microprocessor
- Verify RAM functionality in 90nm CMOS process
- Verify microprocessor functionality (RTL)
  - -700 test cases (>135K vectors)
  - Multi-day run time
  - Vector sets generated used subsequently for power dissipation analysis
    - VCD and TCF formats
- Fully verified RTL used as "golden reference"
  - Regression tests / functional verification prior to tape-out

# **MSV RTL synthesis**



•62% std. cells in 0.8 V domain; 38% in 1.0 V domain

# **Power optimization in synthesis**









- Logic restructuring
- Logic resizing
  - Before clock tree synthesis
- Buffer removal/resizing
- Pin swapping
  - Apply high transition rate signal nets to low capacitance inputs

#### Transition rate buffering

 Buffer slow transition nets to minimize duration in which both pFET and nFET conduct current simultaneously

### V<sub>DD</sub> domains and VLS cells





#### •0.8 V, 1.0 V $V_{DD}$ domains

- Analyze standard cells delay, leakage, standby and dynamic power (2.5x delta)
- Adequate performance for timing critical nets per domain
- −Customization → further improvements feasible
- Automated voltage level-shifting cells (VLS) insertion
  - For nets traversing  $V_{DD}$  domains
  - Align cells to avoid n-well spacing violations (domain perimeter placements)
  - Automated multi-V<sub>DD</sub> power distribution and cell placements, antenna diode insertion

# **Clock gating**

- Architectural clock gating included in uP RTL
- Automated design flow →add further clock gating inferred from RTL through low-power synthesis
  - 1,000 clock gated cells identified and managed  $\rightarrow$  85% registers gated
  - Shut off dynamic current in quiescent logic (application requirements)
- Clock de-cloning: reduce number of cells from 1,112 to 703
  - Move clock gating to the highest hierarchical node of the logic tree
    - Reduce power dissipation, insertion delay





### Timing closure in multi- V<sub>DD</sub> design (1)



- •VLS placement directly affects electrical performance
  - -Optimal or detoured routing
  - –Power-supply-aware timing and multi-V<sub>DD</sub> supply constraints → drive placement
  - -Support ECOs
  - Netlist modified to insert VLS cells where needed

Present approach (automated)
 –Complete timing driven P&R without VLS; insert VLS; optimize

### Timing closure in multi- V<sub>DD</sub> design (2)

Cell substitution with timing constraint

 Replace standard-V<sub>T</sub> with high-V<sub>T</sub> cells
 Net by net basis; same footprint as original cell

 Signal integrity addressed within P&R

 -~10 of 500K nets required post-layout optimization



- Effective current source model (ECSM) →instance-specific multiple V<sub>DD</sub> delay calculation
  - Standard cell libraries characterized for multiple  $V_{DD}$  values at outset

 Numerical model <2% deviation vs. full circuit simulation



### **System-level validation**



IEEE CICC 2005; paper 27-2

Total

38%

47%

40%

0.60

### Summary

- Methodology obtained ~40% overall and 46% and leakage power reduction
  - Managed leakage power with dual-V $_{\rm T}$  cells and dual V $_{\rm DD}$  domains
  - Managed dynamic power with dual  $V_{\text{DD}}$  domains, dual- $V_{\text{T}}$  cells, voltage scaling and automated clock gating
  - Power integrity verified throughout the design process
- Developed design methodology and techniques for power reduction
  - Single-pass synthesis with concurrent optimization (timing, power, area)
  - Newly-developed and current-generation EDA tools
  - Methodology leverageable to synthesis-driven digital designs

### **Summarized IC data**

| Parameter        | Data                      |
|------------------|---------------------------|
| Clock Frequency  | 355 MHz (Typ. Conditions) |
| Technology       | TSMC 90G                  |
| Transistor Count | 22M                       |
| Core Voltage     | 1.0V, 0.8V                |
| I/O Voltage      | 3.3V                      |
| Pin Count        | 362                       |

### **Acknowledgments and references**

- Acknowledgments
  - We thank C. Chu, A. Gupta, J. Goodenough, A. Harry, C. Hopkins, L. Jensen, T. Valind, L. Milano, A. Iyer, P. Mamtora, J. Willis, M. McAweeney, R. Williams and the ARM Physical IP team for their contributions
- References
  - Gartner- WW ASIC/ASSP, FPGA/PLD and SLI/SOC App. Fcst., 1Q04
  - B. Calhoun, "Ultra-Dynamic Voltage Scaling Using Sub-threshold Operation and Local Voltage Dithering in 90nm CMOS," ISSCC, 2/05
  - S. Henzler, "Sleep Transistor Circuits for Fine-Grained Power Switch-Off with Short Power-Down Times," ISSCC, Feb. 05
  - http://www.arm.com/pdfs/DUI0273B\_core\_tile\_user\_guide.pdf.
  - A. Khan et al., "Design and Development of 130-nanometer ICs for a Multi-Gigabit Switching Network System," CICC, Oct. 04
  - D. Desharnais, "Nanometer IC routing requires new approaches," EEDesign.com, Dec. 03
  - A. Khan et al., "A 150 MHz Graphics Rendering Processor with 256Mb Embedded DRAM," ISSCC, Feb. 2001
  - G. Paul, et al., "A Scalable 160Gb/s Switch Fabric Processor with 320Gb/s Memory Bandwidth," ISSCC, Feb. 04