IEEE ©1998 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Return to Table of Contents

A New Course on Superscalar and VLIW Computer Architectures for Real-Time Image and Video Computing

Chris Basoglu, Member, IEEE, and Yongmin Kim, Fellow, IEEE

Abstract - A one-quarter electrical and computer engineering course on advanced microprocessors and real-time image/video computing algorithms is described in this paper. It is primarily aimed at graduate students who have some background in computer architecture, signal and image processing, and experience in low-level programming of microprocessors. The course is designed to teach students not only the advanced microprocessors with instruction-level parallelism, e.g., superscalar and very long instruction word (VLIW) computer architectures, but also how to analyze and efficiently map various image and video computing algorithms on these microprocessors. Emphasis is placed on laboratory and individual project work requiring an in-depth study, implementation, testing and verification, demanding intensive participation from both the students and instructor. It is continually updated and improved to reflect new technologies in this dynamically changing area and to incorporate student feedback. This course has been offered for four quarters at the University of Washington.


I. Introduction

As all forms of media such as film, audio, and video become digital, the need for the acquisition, processing, display and communications of such multimedia data has been spurring rapid technology developments. By taking advantage of these new technologies, many new applications have become possible and existing applications are strengthened with improving cost/performance ratios. Digital video, desktop video teleconferencing, machine vision, medical imaging and digital cameras are several such examples. Image and video computing algorithms and their fast implementations are a few of the core enabling technologies for such applications. Because of the vast market potential, consumer products employing real-time digital video computing have been generating a great deal of excitement among the manufactures and users. The real-time aspect and consumer-level focus of such systems requires high computational power at very low cost and optimal implementations of key algorithms. In prototyping and designing these systems, a programmable approach provides flexibility and adaptability to new applications and changing requirements, which is a definite advantage over specialized hardwired solutions.

There is an ever-increasing need in industry for students who can efficiently implement image and video computing algorithms on commercially available microprocessors. We feel that most computer engineering and image processing courses including ours at the University of Washington do not sufficiently involve the students in actual real-time algorithm implementation. Most courses use high-level image and data processing tools like Khoros and Matlab to illustrate various fundamental concepts in image and video computing [1-2]. While the understanding gained from such high-level tools is valuable, these students lack the training and hands-on experience necessary to implement computationally intensive image/video computing algorithms in real-time systems.

Those courses which do cover the implementation aspects of compute-intensive algorithms typically focus on parallel processing and distributed computing on massively-parallel machines [3-5]. Many people in academia and industry have noted the lack of practical experience afforded to engineering students in this field, and some have gone as far as to make massively-parallel machines available for student laboratories [5]. While these parallel machines have provided the students with useful hands-on experience in parallel programming, the system and maintenance costs are very high, and they are not the machines commonly used in industry in implementing digital image and video systems. Others have used transputer networks [6-8], software simulators [9-10], or "pencil and paper" experiments [11] to provide a parallel programming environment for their students.

Several recent commercially-available general purpose processors and digital signal processors (DSPs) have been designed for the demanding computing needs in processing digital video, images, graphics and audio to enable new consumer-level applications [13]. These new processors employ instruction-level parallelism which includes the superscalar and very long instruction word (VLIW) computer architectures. Instruction-level parallelism allows for multiple CPU operations to be initiated in a single clock cycle. This is done by having multiple execution units on-chip and/or by partitioning the ALU of a particular execution unit into multiple independent pieces (e.g., a 32-bit ALU split into two 16-bit ALUs). Intel Pentium Pro, Hewlett-Packard PA-RISC 8000, Sun Microsystems UltraSPARC, and Texas Instruments TMS320C80 are all recently-developed superscalar/VLIW processors designed to meet the increasing computing requirement due to the proliferation of multimedia. Table 1 lists various characteristics of these advanced microprocessors.

Table 1

Current commercially available superscalar/VLIW microprocessors and DSPs.


Current clock frequency

Num. of operations issued per cycle

Availability of partitioned addition

Num. of multiplies issued per cycle

Die size

Process technology

Intel Pentium Pro

200 MHz




195 mm2

0.35 m

Motorola PowerPC 604

150 MHz




196 mm2

0.5 m


200 MHz




345 mm2

0.5 m

Sun UltraSPARC I

200 MHz




315 mm2

0.5 m

TI TMS320C80

50 MHz




342 mm2

0.6 m

Several new courses have been developed to cover such architectures and processors [8, 14], emphasizing more on the hardware aspects. However, we believe that real-time implementations require a systems approach including software, hardware, and integration. Thus, we have taken a balanced approach in our course that the programmers must understand the underlying hardware which executes their software.

II. Course Structure

The course is divided into two parts: three hours/week lectures and open laboratory where the students work on the laboratory exercises, homework problems and an individual project on the average of 10 to 20 hours a week. The laboratory exercises are designed to exemplify the topics covered in class. An individual project is a full implementation of an image or video computing algorithm with medium complexity on a real multimedia system. We require that the student's project should execute as efficiently as possible on the given hardware.

  1. Overview of image and video computing: current status, challenges and opportunities.
  2. Image and video computing algorithms.
  3. Instruction-level parallelism.
  4. Algorithm development for real-time implementation on programmable processors.
  5. Data flow.
  6. Multitasking operating system issues for multiprocessors.
  7. RISC processors using the TMS320C80 MP as an example.
  8. Superscalar processors using the Power PC 604 as an example.
  9. VLIW processors using the TMS320C80 ADSP as an example.

Fig. 1. The course syllabus.

The syllabus of the course most recently offered is in Fig. 1 which lists the major topics covered. The course starts with a brief overview of the current status and future trends of image and video computing. The computational power required in systems capable of supporting real-time image and video computing is analyzed. Previously, we at the University of Washington have developed two systems for real-time multimedia processing, the MediaStation 5000 (MS5000) and the Programmable Ultrasound Image Processor (PUIP) [15-16]. These systems are used as real-life examples of the topics taught in class and are target platforms for the students' projects.

The general theory and applications of image and video computing algorithms are then covered. This includes the algorithms and various tasks required for JPEG, H.320, and MPEG encoding and decoding [17]. A prerequisite course on the fundamentals of image processing covers these aspects more in detail. These algorithms are then used as examples to illustrate their efficient implementations on the different processor types, e.g., students are taught on how take advantage of the data independence within each algorithm by utilizing the superscalar or VLIW processor's ability to perform multiple operations concurrently. Furthermore, the problem of implementing these algorithms via fixed-point arithmetic and our solution to this problem are also covered.

Fig. 2. A Cooley-Tukey FFT butterfly flowgraph and its computational and I/O requirements.

One of these algorithms covered is the Cooley-Tukey FFT butterfly as shown in Fig. 2. On a processor with no instruction-level parallelism, an FFT butterfly takes 15 cycles because there are 15 basic operations in the flowgraph. The successive butterflies are independent of each other. Thus, on a processor with instruction-level parallelism, independent butterflies can be computed in parallel if the execution order of operations in the butterfly is changed. For example, a three-issue superscalar/VLIW processor with the capability of simultaneously issuing one multiply, one add/subtract and one load/store could perform an add/subtract operation for the current butterfly, a multiply operation for the previous butterfly, and a load operation for the next butterfly. If these operations are repeated in a loop, three consecutive butterflies in the same decomposition stage can be computed in parallel. We have found that these kinds of examples are effective in teaching the instruction-level parallelism and parallel programming, especially when students see significant speed-ups in those algorithms that they are familiar with.

Next, the hardware aspects of the modern superscalar and VLIW computer architectures are covered, and the advantages and disadvantages they confer on real-time algorithm implementations are discussed. Concepts such as instruction pipelining, instruction buffering, out-of-order dispatch, out-of-order execution, reservation stations, dynamic branch prediction and register renaming are introduced with examples. Some of the issues heavily impacting the system performance are then discussed, e.g., a discussion on data flows when on-chip instruction cache, data cache, and/or DMA-controlled SRAM is present. Another example is a review of various external memory types (SRAM, DRAM, VRAM, EDO, SDRAM, and RAMBUS). Also, multitasking operating systems and their use in task management, communications, and synchronization in implementing threaded applications, messages, software ports, and semaphores are presented from the user's point of view.

The Power PC 604 microprocessor is used as a real world example of a superscalar microprocessor [18]. It contains two independent single-cycle integer units, one complex multi-cycle integer unit, a load/store unit, a branch unit, and a floating-point unit. The two single-cycle units are responsible for the execution of all integer operations with the exception of multiplications and divisions which are handled in the complex multi-cycle integer unit. The PowerPC 604 can dispatch up to four instructions per clock period to four out of the six on-chip execution units. To explain how algorithms can be efficiently implemented on such an architecture, all the stages in the Power PC's 6-stage instruction pipeline, Fetch, Decode, Dispatch, Execute, Completion, and Writeback, are discussed in detail. Sample code segments from several image/video computing algorithms are used to show efficient coding on this architecture.

Fig. 3. The block diagram of the TMS320C80 multiprocessor.

The Texas Instruments TMS320C80 is used as an example of both a RISC processor and a VLIW processor. The TMS320C80 a single-chip multiprocessor designed for high-speed image processing and real-time multimedia applications [19]. For example, the processor is capable of performing real-time MPEG-1 compression and decompression [15]. Fig. 3 shows the block diagram of the TMS320C80. The TMS320C80 contains a RISC processor called the Master Processor (MP), four Advanced Digital Signal Processors (ADSPs; each ADSP is a DSP with a VLIW architecture), and a programmable DMA controller called the Transfer Controller (TC). The MP has a floating-point unit which can issue floating-point instructions on every cycle. Each ADSP has a 16-bit fixed-point multiplier, a 3-input 32-bit ALU, a branch unit and two load/store units. The programmable DMA controller allows various types of complex address calculations and data transfers.

All five on-chip processors in the TMS320C80 are capable of executing multiple operations per cycle. Each ADSP can execute one 16-bit multiplication (that can be partitioned into two 8-bit multiplications), one 32-bit arithmetic and logical operation (partitionable into two 16-bit or four 8-bit operations), and two load/store operations in the same cycle. All fixed-point operations, including multiplications, are performed in a single cycle. The MP is capable of executing a floating-point addition and multiplication as well as a load/store operation in parallel. Assuming that the MP executes two operations and each ADSP executes four parallel operations, a total of 18 concurrent operations in the TMS320C80 per cycle can be sustained. For example, in executing the FFT butterfly in Fig. 2, it takes each ADSP only 6 cycles to perform the 15 basic operations. Thus, the TMS320C80 can compute an FFT butterfly in 1.5 cycles. For the two dimensional FFT of a 512512 complex input image, the execution time of the butterfly tight loop on the TMS320C80 running at 50 MHz can be computed as:

(1.5 cycles/butterfly) * (256 * 9 butterflies/row) * (2 * 512 rows) * (20 ns/cycle) = 70.8 ms

(Eq. 1)

The actual measured execution time of the 2-D FFT for a 512512 image is 75.0 ms, so the overhead is low at 5.9%.

Because the TMS320C80's DMA controller is powerful and can be programmed very flexibly, the data transfers can be matched to the needs of a specific algorithm. In this way, students can experiment with different data flows and select the ones which provide the best performance for their algorithms. This additional burden to the programmer on data flow management, however, also makes the low-level programming of the TMS320C80 challenging. The University of Washington Image Computing Library (UWICL) with over 150 image processing functions which was developed for use in TMS320C80-based systems is introduced and made available to the students as examples of efficiently implemented algorithms [20]. It includes low-level functions ranging from convolution and morphology to wavelets, FFT, arithmetic coding and warping. The UWICL's highly modularized design and good documentation make its use straightforward. For example, in developing a high-level algorithm or application, students can use the existing routines from this library and implement only those modules that are not available in the library.

III. Laboratory

Five predesigned laboratories are assigned to introduce actual algorithms and application implementations and to provide hands-on experience to the students. For software development, the Texas Instruments TMS320C80 compiler, assembler, simulator and debugger are made available to the students. The debugger interface allows the students to inspect the contents of all the registers and memory locations (on-chip and off-chip) on a cycle-by-cycle basis. For software verification, Khoros and Matlab are used. Two MS5000 systems allow the students to run their programs on real hardware. Because the MS5000 is a real high-performance multimedia system incorporating various design trade-offs to balance performance and cost, the students are required to understand these trade-offs (e.g., external memory types and bus widths used) and take them into consideration when working on their assignments.

The first assignment is intended to familiarize the students with various superscalar and VLIW processors on the market today. The students are given a list of processors and pointers to their Internet sites where detailed technical information can be found. Each student must choose a processor and write a report on the processor. The student must then report their findings to the class in a short talk. During the presentations, the instructor tries to point out both the common and unique architectural features of the processors.

The second assignment is on the development environment in the laboratory. The students must compile and assemble an existing routine, execute it on both the simulator and on the hardware, and then verify the correctness of the routine. This verification has to be performed by developing and running the functionally-equivalent routine in Khoros, Matlab and C, and comparing the outputs of these three environments with the output of the TMS320C80 via another C program to compute the mean absolute difference and mean square error.

The third assignment handles data flows. The students have to implement a jumble game. The game consists of dividing an image into a series of small rectangles which must be jumbled randomly. One rectangle is chosen to be left out, and is represented by painting it black. The game is solved by moving the rectangles around until the original image is regenerated. An example jumbled output from this assignment is shown in Fig. 4. This assignment is designed to teach the students how to efficiently input and output two-dimensional data and how to build in C and assembly language an application that responds interactively to the user inputs.

Fig. 4. An output from the jumble game assignment.

The students have to build the same jumble game as before, but now each rectangle can be rotated by 90 degrees as well as being moved. In rotating a rectangle, they are required to divide the rectangle into four horizontal or vertical slices so that the four ADSPs of the TMS320C80 can process one slice each in parallel. The processing on each ADSP must be made as parallel as possible by utilizing the instruction-level parallelism. Furthermore, each slice must be divided into small sub-slices so that double buffering can be used to reduce the overall execution time by hiding the I/O time behind the processing time.

The final assignment is devoted to partitioned arithmetic operations. In implementing a function for image thresholding, the students must use partitioned arithmetic so that four 8-bit pixels can be processed in parallel by each ADSP as shown in Fig. 5. Four pixels at a time are subtracted from the threshold value. Four 1-bit flags are used to store whether or not a borrow condition has occurred in each 8-bit subtraction. These flags are then used to output either 0 or 255. Since four output values are generated every two instruction cycles per ADSP, eight pixels can be processed per cycle with all four ADSPs working in parallel.

Fig. 5. The use of partitioned arithmetic to implement image thresholding.

IV. Projects

This course is culminated by the projects on image and video computing algorithm implementations. Each student works individually on his/her own project. In the first few weeks of the quarter, each student is asked to identify a suitable image or video computing algorithm to implement and obtain an approval from the instructor. Each student is expected to do independent research on an interesting and appropriate algorithm. Neither existing UWICL library functions nor previous student projects are allowed. Those students needing help in selecting a project are encouraged to interact with the instructor.

A written project proposal of about five pages is required from each student during the fifth week. The proposal must cover the theory behind the algorithm, how the algorithm will be mapped efficiently on the TMS320C80, the expected performance, and the proposed schedule. Each student makes a 15 minute presentation on the proposal to the class. The students work on their projects until the last day of final exam (eleventh) week.

The projects to date have been quite extensive in scope and variety. Some became the subject of further research [21]. A few sample projects are as follows:

V. Discussion and Conclusions

Because of the number of hours the students need to spend in the laboratory and the limited laboratory resources, the course enrollment has been limited to a maximum of 12 students per quarter. We found that it is important that the students are adequately prepared for the course. Thus, we ask prospective students to apply for the course. They are screened based on their prerequisite background and their likelihood of success in this course before they are allowed to register for the course.

Students get excited towards the end of the five laboratories after obtaining some hands-on experience and having seen even their novice code running much faster than high-end workstations and PCs. After completing the course, quite a few students have become attracted to research on some of the topics discussed in the course. At the same time, we have found that the industry is very eager to recruit these students after their course completion. This is because this course has bridged the gap between advanced microprocessors and image/video computing algorithms and their real-time implementations by specifically teaching the students how to analyze and optimally implement various image and video computing algorithms on processors with instruction-level parallelism, which can be otherwise very challenging. Many pieces of equipment and software development tools were donated by the industry. Most students after successfully passing the course commented that while the course may have been their most "intense" academic experience to date, the unique knowledge, experience and insight they have gained from the course were extremely valuable, and their efforts on the course were worthwhile.


[1] G. W. Donohoe and P. F. Valdez, "Teaching digital image processing with Khoros," IEEE Transactions on Education, vol. 39, pp. 137-142, 1996.
[2] J. F. Arnold and M. C. Cavenor, "A practical course in digital video communications based on MATLAB," IEEE Transactions on Education, vol. 39, pp. 127-136, 1996.
[3] W. E. Toll, "Decision points in the introduction of parallel processing into the undergraduate curriculum," SIGCSE Bulletin, vol. 27, no. 1, pp. 136-140, 1995.
[4] C. H. Nevison, "Parallel computing in the undergraduate curriculum," Computer, vol. 28, no. 12, pp. 51-56, 1995.
[5] R. J. Duckworth, "Introducing parallel processing concepts using the MASPAR MP-1 computer," SIGCSE Bulletin, vol. 26, no. 1, pp. 353-356, 1994.
[6] N. C. Schaller and A. T. Kitchen, "Experiences in teaching parallel computing - five years later," SIGCSE Bulletin, vol. 27, no. 3, pp. 15-20, 1995.
[7] F. C. Berry, "An undergraduate parallel processing laboratory," IEEE Transactions on Education, vol. 38, pp. 306-311, 1995.
[8] V. L. Narasimhan, "A new course on supercomputers and parallel architectures," IEEE Transactions on Education, vol. 38, pp. 340-345, 1995.
[9] B. S. Elenogen, "Parallel and distributed algorithms laboratory assignments in Joyce/Linda," SIGCSE Bulletin, vol. 28, no. 1, pp. 14-17, 1996.
[10] E. A. Lee, "Computing and signal processing: An experimental multidisciplinary course," IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 6, pp. 45-48, 1994.
[11] L. Yang and L. Jin, "Integrating parallel algorithms with parallel machine models," SIGCSE Bulletin, vol. 27, no. 1. pp. 131-135, 1995.
[12] C. Basoglu, W. Lee, and Y. Kim, "An efficient FFT algorithm for superscalar and VLIW processor architectures," Real-Time Imaging, submitted, Feb. 1996
[13] P. Pirsch, N. Demassieux, and W. Gehrke, "VLSI architectures for video compression -- A survey," Proceedings of the IEEE, vol. 83, pp. 220-245, 1995.
[14] W. E. Mattis, "An advanced microprocessor course with a design component," SIGCSE Bulletin, vol. 27, no. 4, pp. 60-64, 1995.
[15] W. Lee, Y. Kim, R. J. Gove, and C. J. Read, "MediaStation 5000: Integrating video and audio," IEEE Multimedia, vol. 1, no. 2, pp. 50-61, 1994.
[16] Y. Kim, J. Kim, C. Basoglu, and T.C. Winter, "Programmable ultrasound imaging using multimedia technologies: a next generation ultrasound machine," IEEE Transactions on Information Technology and Biomedicine, vol. 1, pp. 19-29, 1997.
[17] D. LeGall, "MPEG: A video compression standard for multimedia applications," Communications of the ACM, vol. 34, no. 4, pp. 46-58, 1991.
[18] S. P. Song, M. Denman, and J. Chang, "The PowerPC 604 RISC microprocessor," IEEE Micro, vol. 14, no. 5, pp. 8-17, 1994.
[19] K. Guttag, R. J. Gove, and R. J. Van-Aken, "A single chip multiprocessor for multimedia: The MVP," IEEE Computer Graphics and Applications, vol. 12, no. 6, pp. 53-64, 1992.
[20] J. Kim and Y. Kim, "UWICL: A multi-layered parallel image computing library for single-chip multiprocessor-based time-critical systems," Real-Time Imaging, vol. 2, pp. 187-199, 1996.
[21] S. D. Pathak, Y. Kim, and J. Kim, "Efficient implementation of facet models on a multimedia system," Optical Engineering, vol. 35, pp. 1739-1745, 1996.
[22] P. Mattson, C. Basoglu, and Y. Kim, "Interactive image morphing on a single-chip microprocessor using a multilayered parallel image computing library," Real-Time Imaging, in press, 1998.

Author Contact Information

Chris Basoglu
Equator Technologies, Inc.
520 Pike Street, Suite 900
Seattle, WA 98101
Phone: (206) 812-1290, Ext. 210
Fax: (206) 812-1285

Yongmin Kim
Department of Electrical Engineering, Box 352500
University of Washington
Seattle, WA 98195-2500
Phone: (206) 685-2271
Fax: (206) 543-3842

Author Biographies

Yongmin Kim (S79-M82-SM87-F96) is a Professor of Electrical Engineering and an Adjunct Professor of Bioengineering, Radiology, and Computer Science and Engineering at the University of Washington. His research interests are in algorithms and systems for multimedia, image processing, computer graphics, medical imaging, high-performance programmable processor architecture, and modeling and simulation. Dr. Kim holds a B.S. in electronics engineering from Seoul National University, and an M.S. and a Ph.D. in electrical and computer engineering from the University of Wisconsin. He received the Early Career Achievement Award from the IEEE Engineering in Medicine and Biology Society in 1988.

He is a member of the Editorial Board of the IEEE Transactions on Biomedical Engineering, the IEEE Transactions on Information Technology in Biomedicine, the IEEE Press series in Biomedical Engineering, the Journal of Multimedia Tools and Applications, Telemedicine Journal, and Annual Reviews of Biomedical Engineering. His group has filed 29 invention disclosures and 16 patents, and 17 commercial licenses have been signed. He is a Fellow of the IEEE and a Fellow of the American Institute for Medical & Biological Engineering.

Chris Basoglu was born in Istanbul Turkey on Jan 8, 1971. He received a B.S. degree in electrical engineering at the University of Washington, Seattle, WA. He completed his Ph.D. degree in electrical engineering at the University of Washington, Seattle, WA under the direction of professor Yongmin Kim in September 1997.

His research interests include optimal algorithm mapping to processor architectures supporting various types of parallelism, ultrasound processing algorithm design, and DSP algorithm design. He has taught both University graduate students and industry professionals image processing algorithms, modern microprocessor architectures and on methods to optimally map performance critical algorithms.