# DSP platform for emerging telecommunication and multimedia

### July 2008 to June 2011 Dake Liu, Christoph Kessler, Robert Forchheimer, Erik Larsson, Zebo Peng

### ePUMA: embedded Parallel Computing Platform with Unique Memory Access





# Current running project

- •The title
  - DSP platforms for future telecommunication and multimedia
- •The motivation
  - There is no DSP platform in Sweden and no good one in Europe
  - The available platforms from USA: power and silicon cost are too high
- •The goal
  - To design and implement a high performance and low cost parallel DSP processor, its programming method and tools for communication and multimedia

### •The scope of research

- Instruction set, parallel architecture, programming tools, programming methodology, and application demos on the platform
- •The scope of applications
  - LTE Radio base station, video CODEC, 3D-video game, future handset, broadband cable terminal
- •The participates
  - ISY: DA, CommSys, InfoCoding, IDA: PELAB, ESLAB



2011/5/11





Examples of predictable real-time DSP computing



# Classify ASIP DSPs

- Ultra low power, (ASIP DSP), ~100MOPS, ~1mW, used in medical, measurement, IoT, toys, and low cost home electronics.
- Low power and high performance (general DSP), ~1GOPS, ~100mW, used in low cost mobile phone and other handset.
- Available highend (general DSP), ~10GOPS, less than 4W, used in highend mobile phone, home multimedia, broadband terminals, communication infrastructures, medical equipements.
- Ultra highend (ASIP DSP), ~100GOPS, less than 4W, no special cooling, genral DSP cannot be good enough. More requirements are from markets. See the table in the next page.





# **Market of ultra highend DSP**

| Just few examples                    | 20     | )10  | 20     | LoP  |     |
|--------------------------------------|--------|------|--------|------|-----|
| Just lew examples                    | volume | GOPS | volume | GOPS | (w) |
| Baseband for handset, mobile phone   | >10000 | ~25  | >20000 | ~50  | 0.3 |
| Handset, mobile phone Multimedia DSP | >10000 | ~5   | >10000 | ~10  | 0.5 |
| Baseband for broadband terminals     | >3000  | ~20  | >6000  | ~100 | 1   |
| Baseband for base stations           | ~1000  | ~30  | ~2000  | ~140 | 4   |
| DSP for different gateways           | ~100   | ~20  | ~200   | ~100 | 4   |
| HDTV decoder                         | ~500   | ~5   | ~1000  | ~20  | 1   |
| HDTV encoder                         | ~100   | ~10  | ~500   | ~40  | 1   |

- Volume: Million chips per year
- LoP: Limit of Power consumption





by Dake Liu: dake@isy.liu.se

### Three working groups in our project



# The focus of the project

- Focusing on demonstrating the advantages of the ePUMA SIMD architecture:
  - Parallel memory access with its configurable permutation hardware for advanced addressing, which separate data access from computation.
  - Separate streaming program flow control from kernel computing.
  - Parallel execution overheads can be mostly hidden by decouples data access and control from parallel computing.







### Architectures for stream/kernel

- Deep understanding of function coverage, performance requirements, and data features from radio baseband. Application examples include TD-LTE for 4G mobile, FD-LTE for 4G mobile, MiMAX 802.16e, 11n terminals for WLAN, and broadcasting terminals.
- Also function coverage, performance requirements, and data features of multimedia, surveillance, different image processing, and 3D video game.
- Also functions and requirements of the uplink of LTE base station following R8 and R10 (proposal).





# ePUMA Platform

- Embedded Parallel DSP computing platform with Unique Memory Access
- Each SIMD module contains 16 data level parallel datapaths.
- Up to 8 SIMD modules and a master can be integrated in an ePUMA cluster
- Up to 4 ePUMA clusters can be integrated on one chip as an embedded computing platform
- Extra accelerators can be integrated and implemented using our NoGAP



### Status:

Instruction set decision: 100%! SIMD simulator: 100%! OCN simulator: 80%. P-method: 30% Master: works! DMA: works! Off-chip memory subsystem: not yet. Test chip: No 2011/5/11 Dake Liu, ePUMA 10

by Dake Liu: dake@isy.liu.se

## ePUMA structure in a cluster



### by Dake Liu: dake@isy.liu.se

PINGS UNIVE

### Datapath in an ePUMA module

**16 MUL Operand Formatting** Up to 4 • complex **Multiplication** \* \* \* \* \* \* \* \* \* \* \* \* multiplications per cycle 2 Radix 2 FFT • Stage 1 + ALU 1 BF in parallel FF1 per cycle Stage 2 Up to 16 taps ۲ of FIR per cycle Stage 3 ALU 2 Logic Can run 2 8-• tap FIR or 4 4-ACR ACR ACR ACR ACR ACR ACR TAP FIR in Stage 4 parallel per cycle scale scale scale scale scale scale scale scale round round round round round round round round sat sat sat sat sat sat sat sat flags flags flags flags flags flags flags flags 2011/5/1Writeback engini

#### by Dake Liu: dake@isy.liu.se

# ePUMA programming method

- Kernel based programming model
- Separating data access from computation
- Hide control and data access overheads







# © Copyright of Computer Engineering ISY, LiU by Dake Liu: dake@isy.liu.se ePUMA programming toolchain

• Divide a kernel into "algorithm kernel" and "data access kernel" to expose opportunities of minimizing data access overheads.

### • "Linköping Kernel Collection library".



### Status:



2011/5/11

Kernel identify and replace: feasibility study finished, prototype implemented. started. SIMD-SIMT compiler, code-insertion: finished 90% of feasibility study, Master code compiler: V1 works

Dake Liu, ePUMA



## Parallel programming: method and tools

### Advanced address code optimization for SIMD

• Case studies

2011/5/11

- DCT, no-P<sup>2</sup> DFT, R2/R4 FFT, small matrix, filters with different sizes, nonlinear filters, Sobel operators, functions, MAX/MIN...
- A deeper understanding of ePUMAs SIMD core
  - how it differs from other currently available SIMD architectures such as Cell SPE, and
  - what the main issues and trade-offs are in optimized code generation for ePUMA.
- Goal: automate the construction of SIMD permutation vectors in an optimized way.
  - We implemented a permutation generator and optimizer, which can be used for coding automation.





# ePUMA early Benchmark

| Benchmark of a SIMD module | Limit of the 16-way SIMD | ePUMA | Efficiency |
|----------------------------|--------------------------|-------|------------|
| 64x64 complex matrix mul   | 65536                    | 73737 | 89%        |
| 1k FFT                     | 1280                     | 1710  | 75%        |
| 4k FFT                     | 6144                     | 7635  | 80%        |
| 8k 16-tap real FIR         | 8192                     | 8212  | 99%        |
| 2k 48-tap complex FIR      | 24576                    | 26679 | 92%        |

Uplink ch estimation in a LTE base station: 2Ra, 1200 subcarriers, 2.43MHz Uplink ch eq in a LTE base station: 2Ra, 1Ta, 1200 subcarriers, 35MHz on single SIMD module HDTV encoder: 1920x1080 30fps: (not including CAVLC) 140MHz HDTV decoder: H.264 1920x1080 30fps, 60MHz on ePUMA with one cluster Broadband terminal: xDSLx, HomePlug, DOCSIS3.0, .... 3D video game: 480×320 frame size on one ePUMA cluster, 2-time of overdraw: 80MHz Available accelerators for....



2011/5/11



16

16

#### by Dake Liu: dake@isy.liu.se

### Reviewed papers published in 2009

#### Seven Journal papers

- D Liu, J Sohl, Jian Wang, Parallel computing and its architecture based on data access separated kernels Journal of ISI IJERTCS, January-March, 2010.
- D Liu, A Nilsson, D Wu, J Eilert, and E Tell, Bridging dream and reality: Programmable baseband processor for SDR, IEEE Comm magazine, pp 134-140, September 2009
- A Nilsson, E Tell and D Liu, 11mm2, 70 mW Fully Programmable Baseband Processor for Mobile WiMax and DVB-T/H in 0.12um CMOS, IEEE JSSC, pp 90-97, January 2009.
- Rizwan Asghar, Di Wu, Ali Saeed, Yulin Huang, Dake Liu, Imp of a R-4, Parallel Turbo Decoder and Enabling the Multi-Std Support, J of Signal Processing Systems, 2010.
- Di Wu, J Eilert, R Asghar, D Liu, A Nilsson, E Tell and k Alfredsson System Arch for 3GPP-LTE Modem using a Programmable Baseband Processor in Int J of IJERTCS, 2010
- R Asghar, D Wu, J Eilert, D Liu, Mem Conflict Analysis and Impl of a Re-config Interleaver Arch Supporting Unified Parallel Turbo Dec Springer, J of Signal Processing Systems
- Di Wu, Johan Eilert, Dake Liu, Imp of a High-Speed MIMO Soft-Output Symbol Detector for SDR, J of Signal Processing Systems, Springer, New York, 2009

#### Plus 31 Conference papers

- P Karlström, D Liu, NoGAP a Micro Arch Construction Framework SAMOS IX: Int Symp on Systems, Architectures, MOdeling and Simulation, Samos, Greece, July 2009
- W Zhou, Per Karlström, and Dake Liu, NoGAPCL: A flexible common language for processor hardware description, DDECS 2010, Vienna Austria. April 2010.
- Per Karlström, Weibiao Zhou, Dake Liu, Operation classification for control path synthetization with NoGAP, ITNG 2010, April Las Vegas, USA
- J Wang, Olof Kraigher, Joar Sohl, Dake Liu, ePUMA: a Novel Embedded Parallel DSP Platform for Predictable Computing, ICIEE, Shanghai, June 2010
- J Wang, J Sohl, O Kraigher, D Liu ePUMA a novel embedded parallel DSP platform for predictable computing Int Conf on Infor and Elec Eng, Shanghai, June 2010
- W Zhou, P Karlström, D Liu A Flex Common Language for Processor HW Description IEEE Int Sym on Design and Diagnostics of Electronic C & Sys, Apl 4-6 2010, Vienna.
- Per Karlström, Faisal Akhlaq, Sumathi Loganathan, Wenbiao Zhou, Dake Liu, Cycle Accurate Simulator Generator for NoGAP, 2010, PRIME Asia 2010
- Per Karlström, Wenbiao Zhou, Ching-han Wang, Dake Liu, Design of PIONEER: a Case Study using NoGAP, 2010, PRIME Asia 2010
- Per Karlström, Wenbiao Zhou, Dake Liu, Automatic Port and Bus Sizing in NoGAP, Published: 2010-07-21, Proceedings of SAMOS X
- Per Karlström, Sumathi Loganathan, Faisal Akhlaq, Dake Liu, Automatic Assembler Generator for NoGAP, Published: 2010, PRIME 2010
- R Asghar, D Liu Towards Radix-4, Parallel Interleaver Design to Support High-Throughput Turbo Decoding for Re-Configurability
- D Wu, J eilert, R Asghar, D Liu, VLSI Imp of A Fixed-Complexity Soft-Output MIMO Detector for High-Speed Wireless, EURASIP J on Wireless Commun and Networking, 2010,
- Ingemar Ragenmahm and Dake Liu, Towards using ePUMA architecture for Hand-held video games, CGVCVIP conference July 2010, Freiburg, Germany.
- R Asghar, D Liu, Low complexity multi mode interleaver core for WiMax with support for conv interleaving, Int J of Electr, Commun and Computer eng No 1, pp 20-29, 2009
- R Asghar, and D Liu Towards Radix-4, Parallel Interleaver Design to Support High-Throughput Turbo Decoding for Re-Conf 33rd IEEE SARNOFF, New Jersey, USA, April 2010.
- Di Wu, Johan Eilert, R Asghar, M Ge, D Liu, VLSI Imp of a Multi-Standard MIMO Symbol Detector for 3GPP LTE and WiMAX, 16th IEEE Int Conf on Electronics, C & Sys, 2009
- Di Wu, R Asghar, Y Huang, D Liu, Imp of a High-Speed Parallel Turbo Decoder for 3GPP LTE Terminals, IEEE 8th Int Conference on ASICON, China, October 2009
- Dake Liu, Challenges of Digital Radio Baseband for Multi-Mode Mobile, Invited talk, Embedded Conference Scandinavia, Stockholm, October 2009
- D Wu, J E, D Liu, A Nilsson, E Tell and E Alfredsson, Sys Arch for 3GPP LTE Modem using a Prog Baseband Processor, Proc of Int Sym SoC, Tampere, Finland, October 2009
- J Sohl, J Wang, D Liu, Large Matrix Multiplication on a Novel Heterogeneous Parallel DSP Architecture, 8th Int Sym on APPT, Rapperswil, Switzerland, August 2009
- Di Wu, Johan Eilert, Dake Liu, Evaluation of MIMO Symbol Detectors for 3GPP LTE Terminals, 17th EUSIPCO, Glasgow, Scotland, August 2009
- A Ehliar and D Liu, An Asic Perspective on FPGA Optimizations, 19th Int Conf on Field Programmable Logic and Applications (FPL), Prague, Czech Republic, Sept 2009
- R Asghar, Di Wu, J Eilert, D Liu, Mem Conflict Analysis and Interleaver Design for Parallel Turbo Decoding Supporting HSPA Evolution, 12th EUROMICRO, Greece, August 2009
- R Asghar and D Liu, Low Complexity HW Interleaver for MIMO-OFDM based Wireless LAN, IEEE Int Sym on Circuits and Systems (ISCAS), Taipei, Taiwan, May 2009
- Di Wu, E. Larsson, D Liu, Impl Aspects of Fixed-Complexity Soft-Output MIMO Detection, Proc. of IEEE 69th Vehicular Tech Conf (VTC-Spring), Barcelona, Spain, April 2009
- Mirsad Čirkić, Daniel Persson, Erik G. Larsson, "Optimization of Computational Resource Allocation for Soft MIMO Detection", Proceedings of Asilomar'09, 2009
- Christoph Kessler, Jörg Keller, Optimized Mapping of Pipelined Task Graphs on the Cell BE. Proc. 14th Int. W CPC-2009, Zürich, Switzerland, Jan. 2009.
- M Eriksson, C Kessler. Integrated Modulo Scheduling for Clustered VLIW Architectures. Proc. HiPEAC-2009, Jan. 2009. Springer LNCS 5409, pp. 65-79.
- Jörg Keller, Christoph Kessler, Bert Wesarg. Efficient Simulation of Fork Programs on Multicore Machines. PARS'09: 22nd PARS-Workshop, Dec. 2009.
- C. Kessler. Multicore Möjliga Scenarion för Framtiden [In Swedish] OnTime (No. 3/2009), Combitech AB, Sweden, Dec. 2009.
  - M. Bao, A. et al, "On-line Thermal Aware Dyn Vol Scaling for Energy Opt with Freq/Temp Dep Consideration," Proc. IEEE/ACM DAC'09, July 26-31, 2009.
    - Plus one book chapter: Dake Liu, ASIP for DSP, Invited book chapter, to appear on DSP handbook, chapter 2.5, Springer.



17



by Dake Liu: dake@isy.liu.se

### Reviewed papers published in 2010/2011

- 1. Johan Eilert ASIP for Wireless Communication and Media Linköping Studies in Science and Technology, Dissertations, No. 1298, Linköping, Sweden, February 2010
- 2. Rizwan Asghar Flexible Interleaving Sub-systems for FEC in Baseband Processors Linköping Studies in Science and Technology, Dissertations, No. 1312, Linköping, Sweden, May 2010
- 3. Per Karlström, NoGAP, Novel Generator of Accelerators and Processors, Linköping Studies in Science and Technology, Dissertations, No. 1347, Linköping, Sweden, Nov 2010
- 4. Rizwan Asghar, Di Wu, Ali Saeed, Yuling Huang, Dake Liu, Implementation of a Radix-4, Parallel Turbo Decoder and Enabling the Multi-Standard Support Journal of Signal Processing Systems, 2010
- 5. Di Wu, Johan Eilert, Rizwan Asghar, Dake Liu, VLSI Implementation of a Fixed-Complexity Soft-Output MIMO Detector for High-Speed Wireless Journal on Wireless Communications and Networking
- Di Wu, Johan Eilert, Rizwan Asghar, Dake Liu, Anders Nilsson, Eric Tell and Erik Alfredsson System Architecture for 3GPP-LTE Modem using a Programmable Baseband Processor International Journal of Embedded and Real-Time Communication Systems (IJERTCS), 2010
- 7. Rizwan Asghar, Dake Liu, Multimode flex-interleaver core for baseband processor platform Journal of Computer Systems, Networks and Communications, Vol. 2010, doi: 10.1155/2010/793807
- Rizwan Asghar, Di Wu, Johan Eilert, Dake Liu, Memory Conflict Analysis and Implementation of a Re-configurable Interleaver Architecture Supporting Unified Parallel Turbo Decoding Springer, Journal of Signal Processing Systems, DOI: 10.1007/s11265-009-0394-8
- 9. Per Karlström, Sumathi Loganathan, Faisal Akhlaq, Wenbiao Zhou, Dake Liu Automatic Assembler Generator for NoGAP PRIME 10, 6th Conference on Ph.D. Research in Microelectronics and Electronics, July 18 21, 2010, Berlin, Germany.
- 10. Per Karlström, Wenbiao Zhou, Dake Liu Automatic Port and Bus Sizing in Novel Generator of Accelerators and Processors (NoGAP) SAMOS X, July 19 22, 2010, Samos, Greece.
- 11. Jian Wang, Joar Sohl, Olof Kraigher, Dake Liu ePUMA: a novel embedded parallel DSP platform for predictable computing, Int. Conf Information and Electronics Engineering, Shanghai, China, June 2010
- 12. Per Karlström, Wenbiao Zhou, Dake Liu Operation Classification for Control Path Synthetization with NoGAP ITNG 2010, 7th Int Con on Information Tech: New Generations, Las Vegas, USA, April 2010
- 13. Wenbiao Zhou, Per Karlström, Dake Liu A Flexible Common Language for Processor Hardware Description IEEE International Symposium on Design and Diagnostics of Electronic Circuits and Systems, April 2010, Vienna, Austria.
- 14. Rizwan Asghar, Dake Liu Towards Radix-4, Parallel Interleaver Design to Support High-Throughput Turbo Decoding for Re-Configurability 33rd IEEE SARNOFF Sym, Princeton, New Jersey, April 2010
- Erik Hansson, Joar Sohl, Christoph Kessler, Dake Liu: Case Study of Efficient Parallel Memory Access Programming for the Embedded Heterogeneous Multicore DSP Architecture ePUMA. Accepted for Proc. Int. Workshop on Multi-Core Computing Systems (MuCoCoS-2011), June 2011, Seoul, Korea. IEEE Computer Society Press.
- Erik Hansson, Joar Sohl, Christoph Kessler, Dake Liu: Case Study of Efficient Parallel Memory Access Programming for an Embedded Heterogeneous Multicore DSP Architecture. Proc. MCC-2010 Third Swedish Workshop on Multicore Computing, Gothenburg, Sweden, Nov. 2010.
- 17. Christoph Kessler: Programming Techniques for the Cell Processor. it Information Technology, 53(2): 66-75, Special issue on Multicore, April 2011, Oldenbourg-Verlag, ISSN 1611-2776.
- 18. Christoph Kessler: Compiling for VLIW DSPs. Book chapter, 38 pages, in: S. Bhattacharyya, E. Deprettere, R. Leupers, and J. Takala, eds., Handbook on Signal Processing Systems, Springer, Sept. 2010.
- 19. Rikard Hulten, Christoph Kessler, Jörg Keller: Optimized On-Chip Pipelined Mergesort on the Cell/B.E. Proc. EuroPar-2010 conference, Part II, Springer LNCS 6272, pp. 187-198, August 2010.
- Johan Enmyren, Christoph Kessler: SkePU: A Multi-Backend Skeleton Programming Library for Multi-GPU Systems. Proc. 4th Int. Workshop on High-Level Parallel Programming and Applications (HLPP-2010), Baltimore, USA, Sep. 2010. ACM.
- 21. Johan Enmyren, Usman Dastgeer, Christoph Kessler: Towards a Tunable Multi-Backend Skeleton Programming Framework for Multi-GPU Systems. Proc. MCC-2010 Third Swedish Workshop on Multicore Computing, Gothenburg, Sweden, Nov. 2010.
- 22. Usman Dastgeer, Johan Enmyren, Christoph Kessler: Auto-tuning SkePU: A Multi-Backend Skeleton Programming Framework for Multi-GPU Systems. To appear in: Proc. IWMSE-2011, Hawaii, USA, May 2011, ACM press.
- 23. Jens Ogniewski, Ingemar Ragnemalm, Autostereoscophy and Motion Parallax for Mobile Computer Games Using Commercially Available Hardware, Proc., Game and Entertainment Technologies 2010, Freiburg, Germany 2010
- 24. Ingemar Ragnemalm, Minimalism for Usability: The Design of a Programming Development System With a Minimalistic User Interface, Proc., Interfaces and Human Computer Interaction 2010, Freiburg, Germany 2010
- 25. Min Bao, "System-Level Techniques for Temperature-Aware Energy Optimization," Licentiate Thesis No. 1459, Dept. of Computer and Information Science, Linköping University, December 2010.
- Min Bao, A. Andrei, P. Eles, and Z. Peng, "Temperature-Aware Idle Time Distribution for Energy Optimization with Dynamic Voltage Scaling," Design Automation and Test in Europe (DATE 2010), Dresden, Germany, March 8–12, 2010.
- 27. Jens Ogniewski: "Towards using the ePUMA architecture for hand-held video games", presented at Freiburg 2010



2011/5/11

Dake Liu, ePUMA



# PhD graduated since 2008

| Year | PhD students        | Supervisor           | Title                                                                        |
|------|---------------------|----------------------|------------------------------------------------------------------------------|
| 2009 | Andreas Ehliar      | Dake Liu             | Performance driven FPGA design with an ASIC perspective                      |
| 2009 | Di Wu               | Dake Liu             | Scalable Multi-Standard Radio Baseband for<br>Modern Wireless Communications |
| 2010 | Johan Eilert        | Dake Liu             | ASIP for Wireless Communication and Media                                    |
| 2010 | Rizwan Asghar       | Dake Liu             | Flexible Interleaving Sub-systems for FEC in Baseband Processors             |
| 2010 | Per Karlström       | Dake Liu             | NoGAP                                                                        |
| 2011 | Mattias<br>Eriksson | Christoph<br>Kessler | Integrated Code Generation                                                   |



### Roadmap of the further research



by Dake Liu: dake@isy.liu.se

### Achievement and plan of the project

| System level architecture       |  |  |  |   |  |
|---------------------------------|--|--|--|---|--|
| Application and code analysis   |  |  |  |   |  |
| Data access analysis, OCN       |  |  |  | - |  |
| Instruction set architecture    |  |  |  |   |  |
| Architectural simulation        |  |  |  |   |  |
| Micro architecture research     |  |  |  |   |  |
| Kernel library & code template  |  |  |  |   |  |
| Kernel identification           |  |  |  |   |  |
| Task and Inner-loop scheduling  |  |  |  |   |  |
| Compiling and code-insertion    |  |  |  |   |  |
| Parallelize sequential programs |  |  |  |   |  |
| Minimize control overheads      |  |  |  |   |  |
| HW proven and prototyping       |  |  |  |   |  |
| Future handset demonstration    |  |  |  |   |  |
| TE base station demonstration   |  |  |  |   |  |



Already finished

To finish by end of 2013

To finish by June 2016



by Dake Liu: dake@isy.liu.se



The hardware RTL implementation will be done using our NoGAP Compiler, Assembler, and Simulator can also be generated as references



2011/5/11



by Dake Liu: dake@isy.liu.se

# Coresonic Cooperated with ERICSSON \$









24







25