## Sam Naffziger AMD Senior Fellow





High Performance Processors in a Power Limited World

## Outline

Smarter Choice

Today's processor design landscape

• Trends

Issues making designer's lives difficult

- Power limits
- Scaling effects

**Design opportunities** 

- Circuit level
- Architectural

Summary

### The All Consuming Quest for Greater Performance at Lower Cost





Moore's Law has served us well.

### **Processor Frequency vs. Time**





**MPU Performance vs Time** 

The amazing frequency increases of the past decade have leveled off – Why? Power Limits Process Issues

## Outline

Smarter Choice

Today's processor design landscape

• Trends

### Issues making designer's lives difficult

- Power limits
- Scaling effects

**Design opportunities** 

- Circuit level
- Architectural

Summary

### **Power Consumption Background**



**G. Moore**, *Cramming more components onto integrated circuits*, Electronics, Volume 38, Number 8, April 19, **1965** 

Power has always challenged circuit integration

We've been bailed out by technology in the past





ΔΜ

Smarter Choice



### **Scaling Background**







Process scaling

### **Power Consumption Background**

- Reducing Vdd
- Reducing C<sub>TOT</sub>
- Reducing I<sub>LEAK</sub>, I<sub>CO</sub> –
- Reducing α
- But now, not only are those improvements fading, but we have a host of new challenges
- Variation
- Voltage droop
- Wire non-scaling

Switching Power

The Processor Designer

Crossover Power



 $P \approx \underbrace{C_{TOT} \cdot \alpha \cdot F \cdot V dd^2}_{TOT} + \underbrace{N_{TOT} \cdot \alpha \cdot F \cdot V dd \cdot I_{CO}}_{TOT} + \underbrace{N_{ON} \cdot I_{LEAK} \cdot V dd}_{TEAK} +$ 

The Process
 guys have had
 the biggest
 impact on
 these

## Outline

Today's processor design landscape

Trends

### Issues making designer's lives difficult

- Power limits
- Scaling effects

**Design opportunities** 

- Circuit level
- Architectural

Summary





### The Silicon Age Still on a Roll, But ...



| High Volume<br>Manufacturing                               | 2004              | 2006      | 2008     | 2010                         | 2012            | 2014         | 2016       | 2018            |
|------------------------------------------------------------|-------------------|-----------|----------|------------------------------|-----------------|--------------|------------|-----------------|
| Technology Node (nm)                                       | 90                | 65        | 45       | 32                           | 22              | 16           | 11         | 8               |
| Delay = CV/I scaling                                       | 0.7               | ~0.7      | >0.7     | Delay scaling will slow down |                 |              |            |                 |
| Energy/Logic Op<br>scaling                                 | >0.35             | >0.5      | >0.5     | E                            | nergy sc        | aling will   | slow do    | wn              |
| Bulk Planar CMOS                                           | Н                 | igh Prob  | ability  |                              |                 | Low          | Probabili  | ty              |
| Alternate, 3G etc                                          | L                 | ow Proba  | ability  |                              |                 | High         | Probabili  | ty              |
| Variability                                                |                   | Mediu     | m        | Hig                          | h               | Ver          | y High     |                 |
| RC Delay                                                   | 1                 | 1         | 1        | 1                            | 1               | 1            | 1          | 1               |
| scaling has<br>some nasty<br>side effects<br>୮୮ନ୍ତ Roadmap | trostatic control |           | ulk      | anar                         |                 |              | gat        | e stack         |
| Source: European                                           | elec              | PD        | SOI      |                              | FDSOI           | Μ            | uCFET      |                 |
| Vanoelectronics<br>Initiative Advisory<br>Council (ENIAC)  |                   | stressor  | s        | + subs<br>engine             | strate<br>ering |              | + h<br>mat | igh µ<br>erials |
|                                                            |                   | 200<br>65 | 07<br>nm | 2010<br>45n                  | m               | 2013<br>32nm | 20         | -<br>)16<br>2nm |



### **Device Variation Reverse Scales**

### Accuracy in 0.25 µm CMOS





## The Problem: Atoms don't scale

### Granularity on molecular level is reached: 0.25/0.25 transistor = 1200 doping atoms

$$\sigma_{\Delta VT} \propto \sqrt{1200} \approx 3\% V_T$$

Source: Pelgrom, IEEE lecture 5/11/06

# Variations subtract directly off cycle time

- ➔power efficiency drops
- →Circuit margins degrade





Intel

Granularity on molecular level is reached: 0.1/0.065 transistor = 60-80 doping atoms in depletion region

$$\begin{array}{l} V_{T} \propto 80 \\ \sigma_{\Delta VT} \propto \sqrt{80} \approx 11\% V_{T} \end{array}$$

### **One impact of variation is leakage spreads**





## Scaling Intrinsically Hurts Supply Integrity Smarter Choice



With power per core staying constant but area, voltage and cycle times dropping, we have a big challenge Requiring a higher voltage to hit frequency is a quadratic power impact

## Outline

Today's processor design landscape

• Trends

Issues making designer's lives difficult

- Power limits
- Scaling effects

### **Design opportunities**

- Circuit level
- Architectural

### Summary





### Some Ways to Shoulder the Variation Burden: Adaptive clocking





Empirically set the clock edge to optimize frequency

Higher granularity  $\rightarrow$  more variation tolerance

 $L_{\ensuremath{\text{BIST}}}$  and GA search algorithms show promise for per-part optimization

Some Ways to Shoulder the Variation Burden: Self Healing Designs



### Simplest example is cache ECC on memory arrays

### Next level is Intel's Pellston technology implemented on Montecito and Tulsa

Disable defective lines detected by multiple ECC errors

# Future directions involve self-checking with redundant logic and retry

- Predict result through parity, residues or redundant logic
- On an error, replay calculation before committing architectural state
- If replay correct, it was a transient error (particle strike, Vdd droop, random noise coupling etc.)
- If incorrect can reduce frequency, increase voltage or retry with an alternate execution path

### Some Ways to Shoulder the Variation **Burden: Self Healing Designs**





All Rights Reserved, Copyright© FUJITSU LIMITED 2006

### **Adaptive Supply Voltage**





Per-part and dynamic voltage management are key

More range flexibility and finer grain response will provide differentiation



### Integrated Power and Thermal Management



"Fuse and forget" is no longer viable

Too much variation in environment, manufacturing and operating conditions

Some means of dynamic optimization needed





### An autonomous programmable controller enables real time optimizations

Integrated Power and Thermal

An embedded controller provides the needed flexibility

OS interfacing

Management

- Multi-core management
- Per-part optimization







## Outline

Today's processor design landscape

• Trends

Issues making designer's lives difficult

- Power limits
- Scaling effects

### **Design opportunities**

- Circuit level
- Architectural

Summary





### **Traversing the Power Contour**



#### **Power Consumption**



### **Traversing the Power Contour**





## Traversing the Power Contour for a Given Implementation



**Energy / Operation** 



### For Comparing Architectural Efficiency, Performance<sup>3</sup>/W is most effective



Performance<sup>3</sup> / Watt



### **Optimal Pipeline Depth**



V. Srinivasan et al., MICRO-35







### A Look at Mobile System Power





If a laptop burned TDP power all the time, battery life would be measured in minutes

How do we get mobile average power so much lower than TDP?

### The Answer: AMD Take Advantage of Typically Low CPU Utilization Smarter Choice



### Reducing Power and Cooling Requirements with AMD Processor Performance States



clocks completely and dropping voltage to retention levels







### Adding Features to Increase Performance





Increasing execution efficiency has, historically hurt power efficiency
However, the cubic reduction of power with V/F scaling has tended to make this a good tradeoff



### Adding Features to Increase Performance Works with V/F Scaling





IPC

Voltage scaling has it limits

- ➔More power efficient designs have an advantage
- → High power designs get penalized due to higher di/dt, higher temperatures etc.

If we hit V<sub>MIN</sub> however, the game is over

## How Hard is Improving Existing Processors? AMD

#### Watts/(Spec\*Vdd\*Vdd\*L)



Most of the Big hitter improvements have been heavily mined already



Smarter Choice

Next generation AMD cores have >> 50% of clocks gated off even for high power code

### **Multi-Core to the Rescue?**



| Cache                                                                  |                            | Cache                                               |                                                      |  |  |
|------------------------------------------------------------------------|----------------------------|-----------------------------------------------------|------------------------------------------------------|--|--|
| Core                                                                   | Core                       |                                                     | Core                                                 |  |  |
|                                                                        |                            |                                                     |                                                      |  |  |
| Voltage =<br>Frequency =<br>Area =<br>Power =<br>Perf =<br>Perf/Watt = | 1<br>1<br>1<br>1<br>1<br>1 | Voltag<br>Freque<br>Area<br>Power<br>Perf<br>Perf/W | e =.85<br>ency =.85<br>=2<br>=1<br>≈1.7<br>/att ≈1.7 |  |  |

Sounds like a great story, what's the catch?

### Multi-Core to the Rescue?

Some of the catches:

- What if you're already at V<sub>MIN</sub>? Need to cut frequency in half to stay within power limit ⊗
- How much parallelizable code is really out there?
- More compute capacity means more IO and memory bandwidth demands ...





## There is almost always a portion of an application that cannot be parallelized

Multi-Core Issues: Amdahl's Law

- This portion becomes a bottleneck as the number of threads is increased
- A typical value is in the range of 10%



#### Cache Cache Core Core Core Voltage =.85 Voltage =1 Frequency =.85Frequency =1Area =2 Area =1Power =1=1 Power Perf ≈1.7 Perf =1Perf/Watt ≈1.7 Perf/Watt = 1

Just 10% serial code drops 8 core performance improvement by 41%



### **Multi-Core Issues: IO Power**

All those extra cores need their own data ...

IO power in terms of W/Gb/s has been pretty constant in the range of 20mW for years





If we increase IO power accordingly, but hold total chip power constant with V/F scaling, things get worse
Overall performance drops by another 10% or

SO



## The Transition to Parallel Applications

### Single-threaded Applications

Most of today's applications

Well understood optimization techniques

Advanced development, analysis and debug tools

Conceptually, easy to think about

### **Parallel Applications**

Small number of applications (worked by experts for 10+ yrs)

Awkward development, analysis and debug environments

Parallel programming is hard!

Amdahl's law is still a law...

SW productivity is already in a crisis  $\rightarrow$  *this worsens things!* 

Establishing an <u>appropriate balance</u> is key for managing this important transition

### Other Architectural Directions: Integration



Not only does the integration of more system components (i.e. memory controllers, IO etc.) improve performance



Integration reduces power significantly as well

- IO communication overhead drops
- CPU integrated power management can dynamically optimize
- Power efficiency of special function components (i.e. graphics accelerators, network processors etc.) greatly exceeds that of general purpose CPUs

### **System-level Power Consumption**



### **Dual-Core Packages with legacy technology**

- 692 watts for processors (173w each)
- 48 watts for external memory controller

### 95% More Power

### **Dual-Core AMD Opteron™ processors**

AMD

- 380 watts for processors (95w each)
- Integrated memory controllers



### **System-level Power Consumption**



380 watts



Integrated memory controllers

• 48 watts for external memory controller

### 95% More Power

tual s



### Other Architectural Directions: Integration

Integrating dual designs for processor core enable both peak performance and throughput/watt

### **Barriers**?

- Integration of heterogeneous designs non-trivial
- IP barriers
- Schedule issues with multiple converging components







### AMD Smarter Choice

### Summary (1 of 2)

Silicon process technology is unlikely to be the major engine of processor performance increases in the future

Major circuit related challenges that we've only just started to address lie ahead:

- Design for variation tolerance and mitigation
- Maintaining dynamic voltage headroom within reliability and variation imposed limits
- Adaptive, self-healing techniques are a key direction





### Summary (2 of 2)

Silicon process technology is unlikely to be the major engine of processor performance increases in the future

- CPU architectures are converging on modest pipe length, limited issue out of order designs
- Multi-core is good, but has limits in the not too distant future
- Heterogeneous integration is a key direction



We're up to the challenge, but it will be a joint effort ...