Harnessing the Power of Parallel Computation on the IBM BlueGene/L to Analyze Complex Digital and RF Systems

Raj Mittra

Electromagnetic Communication Laboratory

**Penn State University** 

The International Technology **Roadmap for Semiconductors (ITRS)** projects the need for package delay accuracy of 1% of the off-chip clock frequency. With projected on-chip speeds of 15/33 GHz by 2010/2015, the off-chip interconnect is projected to reach 10/30 Gbps by 2010/2015

### Introduction

#### **Incorporating EM Structures into Circuit Simulation**



**Courtesy of http://www.ansoft.com/hfworkshop03/Weimin\_Sun\_Vitesse.pdf** 

# Introduction (cont'd)



# Introduction (cont'd)





**Eight-Bit Buss Cross-Talk Example** 

Tech=0.18um,Vdd=1.8v, Wire length=3000um,

Noise induced on the quiet line?
Peak Noise including L = 1.17V (Vdd=1.8V)
Peak Noise ignoring L = 0.54V 0.18um tech 0.18um tech





Figure 7: Package structure  $280\mu$ m× $680\mu$ m× $56\mu$ m with 56 leads. The package is sandwiched by a top and a bottom Al<sub>2</sub>O<sub>3</sub> ceramic layer ( $\varepsilon_r$ =9.8) of thickness 20 $\mu$ m. A dielectric slab of  $\varepsilon_r$ =11.9 and thickness of 10 $\mu$ m is placed in the space between the leads (96 $\mu$ m×220 $\mu$ m).



Fig. 1. A multilayer interconnect with the power distribution grid highlighted; the ground lines are light gray, the power lines are dark gray, and the signal lines are white.



ig. 3. Current loop with two alternative current return paths. The forward urrent  $I_0$  returns both through return path one with resistance  $R_1$  and nductance  $L_1$ , and return path two with resistance  $R_2$  and inductance  $L_2$ . In this structure,  $L_1 < L_2$  and  $R_1 > R_2$ . At low frequencies (a), the ath impedance is dominated by the line resistance and the return current is istributed between two return paths according to the resistance of the lines. Thus, at low frequencies, most of the return current flows through the return ath of lower resistance, path two. At very high frequencies (b), however, the ath impedance is dominated by the line inductance and the return current is istributed between two return paths according to the inductance of the lines. Thus, at low frequencies (b) the line inductance and the return current is ath impedance is dominated by the line inductance and the return current is ath impedance is dominated by the line inductance and the return current is ath impedance is dominated by the line inductance and the return current is istributed between two return paths according to the inductance of the lines. Aost of the return current flows through the path of lower inductance, path ne, minimizing the overall inductance of the circuit.



Power/ground grid structures under investigation: (a) a noninterdigited grid, (b) a grid with the power lines interdigitated with the ground lines, and (c) paired grid, the power and ground lines are in close pairs. The power lines are gray colored, the ground lines are white colored.

Fast Flip-chip Power Grid Analysis Via Locality and grid Shells



#### Fig. 1. Flip-chip die showing C4 bump connections. (from Chiprout)



Fig. 2. M2 voltage contour map of a Pentium® grid within a 4x4 array of C4 bumps.



#### Fig. 3. M4-M5 via Currents.



Fig. 4. Current source partitions with P5 source partition high-lighted, with its power grid shell, P1-P9.



Fig. 5. Currents of a complete Pentium® microprocessor power gird model with top 4 metals, obtained in less than 30 minutes.



#### Fig. 1 Mechanical mode! of the shield and PCB



Geoffrey W. Burr (IBM Almaden Research Center, San Jose, CA 95120), "Numerical modeling for nanophotonics design."



Geoffrey W. Burr (IBM Almaden Research Center, San Jose, CA 95120), "Numerical modeling for nanophotonics design."

# **EM Noise in PCB**



# **Motivation**

### Problem: Switching noise ....



### How to mitigate the noise beyond 500MHz

# State of Current Technology



# Coupling to Sensitive Devices in a Multi-Layer Stack up



# Wideband Noise Mitigation in PCBs

#### Concept: cascaded filter design





#### **Spiral Inductors**



#### **Details of the Geometry**

#### • Box Dimension:

- 792150nm along x; 1450210nm along y; 159 divisions along x; 500 divisions along y.
- Layer stackup (starting from bottom{thickness,  $\varepsilon_r$ }): {300*nm*,2.9}; {512*nm*,2.9}; {512*nm*,2.9}; {640*nm*,2.9}; {768nm,2.9}; {1152nm,2.9}; {1944nm,2.9}; {600*mm*, 1.0}.
- Frequency: 1GHz.
- Number of Unknowns: 6940.
- Number of Ports: 4.

# CBFM REDUCES THE MATRIX SIZE TO 8 CBMOM IS HIGHLY PARALLELIZED

# Parallel Conformal FDTD Solver



### **Conformal FDTD (CFDTD) Solver**



### Computational problems and suggestions for each step of MoM



|   |                                   | Conventional<br>MOM/FMM                                                                                                    | CBMOM                                                                                                                                                                                       |
|---|-----------------------------------|----------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | Localization<br>of the<br>Problem | Yes & No<br>Computes + stores only the<br>near field interactions but<br>iterates on the entire solution<br>vector         | Localization is achieved by using a windowed incident field                                                                                                                                 |
| 2 | Multiple<br>Frequencies           | Treated individually-one at a time                                                                                         | <b>CB's can be generated up front for<br/>a few frequencies and then used<br/>over the range.</b>                                                                                           |
| 3 | Other<br>attributes               | <ul> <li>(a) Kernel-dependent</li> <li>(b)Can lead to Ill-conditioned<br/>Matrices specially near<br/>resonance</li> </ul> | <ul> <li>(a) Not kernel-dependent.</li> <li>(b)Handles resonant structures<br/>without difficulty because solves<br/>problems without iteration</li> <li>© Highly parallelizable</li> </ul> |

# A small cluster (including 3 PCs):



Number of processors: 3; Computer: Dell Precision Workstation 340 Processor: Intel Pentium 4 3.06 GHz; RAM: RDRAM 2GB Dual channel NIC: Intel 1 GHz; Switch: Dell Power Connect 2508 1GHz Operating System: Redhat Linux 9.0; Fortran Compiler: Intel Fortran 7.0 C++ compiler: GCC 3.2; MPI: MPICH 1.2.5

# Large cluster (including 256 processors)



Number of nodes: 128; Number of processors: 256; Computer: Dell PowerEdge 1750;

Processor: Dual Intel Xeon 3.2 GHz, 1 MB advanced Transfer Cache RAM: ECC DDR SDRAM (2x2 GB); NIC: Dual embedded Broadcom 10/100/1000 NICs

Switch: Myricom; Operating System: Redhat Linux AS 2.1

Fortran Compiler: Intel Fortran 7.0; C++ compiler: Intel C++ ; MPI: MPIGM

# **Architecture of IBM BlueGene/L**



# Finite Difference Time Domain Yee Algorithm



### Use Multiple Processors to simulate a Large Problem



#### UPWARD OF 10 BILLION UNKNOWNS SOVED ROUTINELY N PSU CLUSTER CAN DO MUCH BIGGER PROBLEMS ON THE BLUE GENE

### Information exchange procedure



### Subdomain division

| CFDTD Solver Parallel processing                   |             |                     |                   |              |           |
|----------------------------------------------------|-------------|---------------------|-------------------|--------------|-----------|
| No. of Processes                                   | Direction S | ecial Processes     | Current Process   | Index of Pro | Cess<br>• |
|                                                    | 8           | 16                  | 24                | 31           | 38        |
|                                                    |             |                     |                   |              |           |
| Processor distribution along the $	imes$ direction |             |                     |                   |              |           |
| - + 1                                              |             | Browse Machine List | Edit Machine List | Help         | Cancel Ok |

# 1-D parallel processing configuration.

Original problem

1-D domain decomposition



# 2-D parallel processing configuration.

Original problem

#### 2-D domain decomposition



# 3-D parallel processing configuration.



# Data exchange configuration for processes along the y-direction



# Overlapping region between regular FDTD and subgridding regions



# Subgridding implementation in the parallel processing



#### **Result collection**





## Efficiency of parallel FDTD solver (Penn State Lion-xm cluster)



# Scalability of parallel FDTD solver (Penn State Lion-xm cluster)



## Scalability and Efficiency of parallel FDTD solver (IBM BlueGene/L)



#### **PFDTD NP Scaling Results**



#### **PFDTD Percent Scaling Results**



# Antenna array fed by coaxial cable (continued)



### 100x100 patch antenna array (array size: 48.25 wavelengths)

3-D view





### Serial/Parallel FDTD Mimics TDR



CAN ENHANCE PROBLEM SOLVING CAPABILITY BY ORDERS OF MAGNITUDE IN TERMS OF NUMBER OF UNKNOWNS

# Point (1)





### **IBM BlueGene/L**



#### Code Enhancer—Beowulf on the Go--in a Box--Just Plug into the USB Port and Play



#### **POOR MAN'S BLUE GENE??**