Low Power Multimedia Reconfigurable Platforms

Low Power Multimedia Reconfigurable Platforms
Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.

High Performance System 구현을 위한 제반 요소
High Speed High Density Reduced Swing Logic Deep Submicron Technology Low Power per Gate Low Voltage Channel Engineering Low Capacitance Low VT Advanced Technology

전력 소모에 대한 고찰 Digital 회로에서 전력 소모의 구성 성분
Dynamic power가 전력 소모에 있어 가장 큰 부분을 차지한다. Library가 주어진 상태에서 설계자가 조절할 수 있는 요소는 activity, VDD, frequency, routing capacitance 네가지 이다.

전력 소모를 줄일 수 있는 설계 방법 공급 전압을 조절하는 방법
IC 내에서 high speed가 필요한 곳에만 높은 전압을 사용한다. 사용하지 않는 block에 대해서는 sleep mode로 전력 소모를 줄인다. 동작 주파수를 낮추는 방법 Parallel processing으로 같은 throughput을 얻으면서 동작 주파수는 낮춘다. 이로 인한 면적의 증가는 필연적이다. 큰 clock buffer의 사용을 피한다. Phase Locked Loop (PLL)을 사용하여 필요한 곳에만 주파수를 높여 사용한다.

전력 소모를 줄일 수 있는 설계 방법 Parasitic capacitance를 줄이는 방법
Critical node에 짧은 배선을 사용한다. 3배 이상의 fan-out을 피한다. 낮은 전압 사용시 배선의 폭을 줄인다. 가능한 한 작은 크기의 transistor를 사용한다. Switching Activity를 줄이는 방법 Bit 수를 감소시킨다. Dynamic 회로보다는 static 회로를 사용한다. 전체 transistor 수를 줄인다. 가장 active한 node는 internal node로 결정한다.

전력 소모를 줄일 수 있는 설계 방법 Switching Activity를 줄이는 방법
각 node 에서 주파수와 capacitance의 곱의 합이 최소가 되도록 logic을 설계한다. 즉, switching activity가 통계적으로 최소가 되도록 한다. Logic tree를 결정할 때, 입력 신호의 activity가 높을수록 VDD 또는 ground에서 멀리 위치시킨다. Activity가 큰 cell은 dynamic으로, activity가 작은 cell은 static으로 설계한다. Data가 변하지 않는 flip-flop의 clock을 off 시킨다. 항상 사용하지 않는 cell의 clock을 disable시킬 수 있도록 한다.

ERE Framework ERE illustrate the performance-energy tradeoffs by concurrently considering the performance improvement, energy savings, and resource-efficiency of a system. i=base configuration with 1 resource j=new configuration with N resource ERE=•  (=fraction of the energy saved) ( =normalized efficiency) ={E(1, i)-E(N, j)}/ E(1, i) =S(N,j)/j•N S(N,j)=T(1,i)/T(N,j) ERE suggests 4 DSPs whereas EDP suggests 12DSPs without considering the efficiency

ERE Framework values Good result Bad result Impact on S(N,j)>1
Speed up 1 0 Efficiency >0 <0 Energy Improvements ERE>0 ERE<0 Performance-Energy tradeoff

NoC (network on chip) 단일 반도체 칩 상에 통신망 구조를 이식
U.C. Berkeley 단일 반도체 칩 상에 통신망 구조를 이식 OSI model에 의해서 전송 프로토콜을 정의 DSP/microprocessor/Memory 등을 H/W-S/W co-design 이용 단일 칩 내에서 연결 코드 최적화 및 저전력 software IP 라이브러리 구축 모듈간 연결을 위한 버스 구조 구성 요소 Region: 특수한 토폴로지/네트워크 구조를 허용하는 영역 Backbone Wapper : 전송되는 메시지를 적절한 형태로 변환, 복잡하다 복잡하고 대형 시스템에 적합 이 슬라이드에서는 시스템 상의 구성 요소에 대해서 주소를 부여하고 상호간의 데이터가 패킷의 형태로 송신지로부터 목적지까지 목적지 주소를 갖고 통신 네트워크를 통해 전송되는 구조를 갖는 Network On Chip 설계 기반에 대해 설명한다. Network On Chip, 약칭 NoC는 美 U.C. Berkeley등에서 연구되기 시작한 개념으로 단일 반도체 칩의 시스템 설계에 통신망 구조를 이식하고자 하는 시도이다. 이렇게 이식된 통신망 구조는 OSI 7 layer와 같은 전송 프로토콜로 정의 되고, DSP/mProcessor/Memory/ASIC 등과 같은 이종의 대형 시스템이 단일 칩 내에서 random한 데이터 전송을 가지며 연결되는 구조를 지원하기 위해 연구되기 시작하였다. 이 NoC 설계는 Sweden Royal Institute of Technology 및 Finland VTT Electronics 등에서의 연구에 의해 보다 구체화 되고 구성 요소들을 정의하였다, 이들이 구성한 NoC 구성은 CLICHÉ라는 이름의 2차원 격자 배열된 전송선과 교차점을 구성하는 스위치 그리고 스위치에 매달인 자원(resource)인 대형 시스템으로 구성되어 있으며 구체적은 형상은 다음 슬라이드에서 설명한다. 이들에 의해 정의된 구성 요소들은 다음과 같다. [1] Region 2차원 배열된 스위치 네트워크에서 단일 스위치 혹은 일정 범위의 스위치 들을 포괄하는 고립된(insulated) 특수한 영역으로 스위치에 연결된 자원을 포함한 영역 개념이다. 이 영역은 NoC 구성과 다른 특수한 topology 혹은 특정 내부 통신 구조도 허용될 수 있다. 성능상의 이유로 CLICHÉ와 같은 구조가 적합하지 않을 경우 사용된다. Sub-network과 같은 개념은 아니며 효율적인 방법으로 통신을 구성하기 위한 작은 메커니즘 정도로 고려된다. [2] Backbone 백본은 NoC에 기반한 시스템을 위한 포괄적인 개발 플랫폼이다. 백본의 역할은 설계 지침 및 유연성을 둘다 갖는 ASIC 설계를 위한 견실한 시작점을 제공하는 것이다. NoC 설계 과정을 backbone/platform/application 개발과 같은 세가지 단계로 분류 했을 경우, 백본 설계 과정에서는 region들의 타입을 결정하고 통신 채널 및 스위치들, 네트워크 인터페이스 및 resource들 그리고 통신 프로토콜을 준비하는 단계이다. 이렇게 준비된 기본적인 요소들을 사용하여 설계자는 소프트웨어 혹은 구성가능 하드웨어들을 백본을 통해 mapping 할 수 있다. [3] Wrapper 이종의 자원간의 통신을 위해서, 그리고 서로 다른 region 간의 데이터 교환을 위해서 전송되는 메시지를 적절한 형태로 변환하는 것으로 복잡하다. 이러한 NOC 구조는 네트워크 구성이 비교적 복잡하고 패킷의 전송등을 통해 random 데이터 전송에 효율적이므로 대형 시스템의 구성에 적합니다.

스위치 네트워크: CLICHE OSI 모델을 데이터 전송 프로토콜로 사용 칩에 집적된 네트워크 (Network on Chip)
패킷 데이터 전송 대형 시스템이 구성 요소 이종 구성 요소의 칩 레벨 집적에 유리하다. 이 슬라이드에서 보인 그림은 앞 슬라이드에서 설명한 NoC 구성의 16개의 자원을 갖는 CLICHÉ 구조와 개별 스위치 구조를 보인 것이다. 이는 칩에 집적된 네트워크의 구조를 취하고 있으며, 스위치들이 일종의 패킷 전송을 위한 router의 역할을 수행하고 있다. 또한 전송 프로토콜을 OSI layer model로 정의하고 있으며, 왼쪽의 그림에서 각 자원이 micro-processor, DSP등으로 구성된 것과 같이 대형 시스템을 구성 요소로 갖고 있다. 그림에서 P는 processor core , D는 DSP core, c는 cache, M은 memory 그리고 re는 reconfigurable block이다. 대형 이종 시스템 간에는 데이터의 전송 요구가 일정하지 않고 무작위 적이다. 이러한 경우 기존의 인터넷 망 혹은 패킷 전화 교환망에서의 데이터 전송 요구와 유사하며, 따라서 이러한 네트워크 구조를 칩에 보다 유리하게 집적할 수 있다.

NoC 의 figure of Merit Functionality Capacity Performance System
Scalability Efficiency Utilisation Computation Energy consumption Fault tolerance Result quality (accuracy) Responsiveness Storage Communication Capacity Functionality Performance Materials Licencing Production Structural Functional Control System Quality Implementation Complexity Variability Cost Development Flexibility Volume Effort Time Risk Modifiability Configurability Applicability Modularity Cohesion Coupling Lifetime Usability Programmability Manufacturability

NoC의 저전력 문제 어플리케이션 레이어 - DPM, 리소스 관리, 전력 관리 API 트랜스포트 레이어
- QoS 보장 (지연 및 메시지 손실 최소)을 위한 데이터 패킷 관리 문제, 메시지를 통한 PSM 네트워크 레이어 packetized 데이터 전송시 스위칭 및 라우팅 문제 데이터 링크 레이어 패킷 데이터 에러 손실 감축 및 복구 문제 Physical 레이어 - DVS에 따른 신뢰성 문제, 온 칩 동기 문제

NoC기반의 응용 분야 Low Power communication systems
High-perforrmance communication systems Baseband platform High-capacity communication systems Personal assistant Database platform Data collection systems BACKBONE Multimedia platform Entertainment devices PLATFORMS Virtual reality games SYSTEMS

NoC 설계 flow R. Marculescu

Structural layers of NOC
Product System control, product behaviour Configuration Network management, allocation, operation modes Applications Resource management, diagnostics, applications Functions Execution control, functions Executables RTOS, code, HW configurations Hardware units Processors, memorires, configurable HW, logic Resources Resource types, buses, IO Regions Region types, switches, network interfaces Communication Channels and protocols

Network protocol Application System/Session Transport Network
Physical 신호 전압, 타이밍, 버스 폭, 신호 동기 Data link 오류 검출 정정 Arbitration of physical medium Network IP protocol 데이터 라우트 Transport TCP 프로토콜 End –to-end connection Physical Data link Network Transport System/Session Application NoC 설계에서 통신 프로토콜로 제시되는 OSI layer model의 계층별 기능과 각 계층별로 NoC 구성에서 수행되는 기능은 다음과 같다. Physical layer : 자원과 스위치들을 연결하는 결선의 길이와 수를 결정한다. 통신망 모델에서는 신호 전압, 타이밍,버스 폭 등을 설정 Data-link layer : 스위치와 자원간에 그리고 스위치와 스위치 간의 전송을 위한 프로토콜을 정의하며, 통신망 모델에서는 오류 검출 및 정정 그리고 물리적 매체와의 중재의 기능을 수행한다. Network layer : 기존의 통신망 모델에서는 IP protocol를 정의하고 데이터의 routing를 제어 했다. NoC에서는 수신자의 주소를 가지고 어떻게 임의의 송신자에서 수신자로의 네트워크를 통해 패킷을 전송할 것인지를 정의한다. Transport layer : 종단간 연결과 안정적 전송을 위한 TCP protocol를 지원하는 계층이었으며, NoC에서는 메시지의 크기를 변경 가변하는 기능을 수행한다. 따라서 전체 메시지를 Network layer에서의 패킷으로 packing하는 과정이 포함된다. 이러한 위의 네 계층의 기능은 앞 슬라이드에서 보인 CLICHÉ 구조에서 스위치와 연결된 RNI interface에 구현된다.

NOC Platform development
Scaling problem How big NOC is needed? What are the application area requirements? Region definition problem What kind of regions are needed? What kind of interfaces between regions? What are the capacity requirements for the regions? Resource design problem What is needed inside resources? Internal computation type and internal communication? Application mapping flow problem What kind of languages, models and tools must be supported? How to validate and test the final products?

NOC Application Development
Mapping problem How to partition applications for NOC resources? How to allocate functionality effectively? Is the performance adequate? Is the resource usage in balance? Optimisation problem How to perform global optimisation of heterogenuous applications? How to define right optimisation targets? How to utilise application/resource type specific tools? Validation problem Are the contraints met? Are the communication bottlenecks or power consumption hot spots? How to simulate GIPS system? How to test all applications?

Network on Chip alternatives
NOC = Network of computation and storage resources NOC parameters: Number of resources Types of resources GPU DSP Memory Configurable HW Coprocessors Any combination Communication capability

Layered Radio Architecture
Processing layer의 특징 Liner한 연결 고정적 요소와 재구성요소의 결합 모듈 다른 모듈에 영향을 미치지 않고 독자적으로 운용 전체 흐름은 pipeline으로 구성

Stallion device from Virginia Tech
스위치 네트워크 Srikanteswara Smart Crossbar IFU mesh Stallion processor Cross bar – circuit switching과 유사 패킷 데이터 전송 계층화된 전송 구조 이후로는, 프로세서와 같이 프로그램의 thread를 동작 사이클을 갖고 처리하는 경우, 혹은 mProcessor/ASIC/DSP 등과 같은 이종의 대규모 프로세서들 간의 random한 데이터 전송 네트워크(예: CLICHE)의 경우에 대한 예를 제시한다. Virginia Tech.에서 개발한 Stallion Processor이며 구동 시간중에 재구성이 가능하며 스트림은 데이터와 프로그래미 헤더의 결합으로, 이 스트림이 이 프로세서를 이동하면서 프로그래밍 헤더는 지역적으로 재구성되어 위아래로 덧붙여진 IFU mesh내의 데이터의 연산장치를 지정한다. [1] Cross bar : 서킷 스위칭의 경우와 유사하다. 그러나 전송되는 데이터가 패킷을 형태를 갖으므로 일정한 프로그램 thread를 처리하기 위하여 고정된 길이의 패킷이 일정하게 통과할 수 있는 가변 네트워크와 같은 기능을 제공한다. [2] 패킷 데이터 전송 : stream이라는 일정한 이름지어진 일정한 길이의 패킷단위로 데이터를 전송한다. [3] 계층화(layered)된 전송 구조 : 프로세서 내부에서 프로세싱 요소간에 데이터가 전달되고 일련의 프로세싱 요소을 통과하는 순차가 제어되는 과정이 계층적인 통신 모델의 형태이다. 즉 데이터 전송과 프로세싱 요소들을 통과하는 일련의 전송과정으로 표현되는 처리될 연산(명령)의 동작 모델이 계층적 모델로 제시된다는 것이다. 이 계층적 모델은 통신 protocol과 유사하다. 즉 각 일련의 프로세싱 요소를 통과하는 패킷의 흐름을 스트림으로 두고, 이 스트림이 통과하는 과정을 계층적 모델로 제시했으며, 각 프로세싱 요소들은 내부적으로 계층적 모델에서 필요로 하는 과정을 처리하기 위한 구성을 취하고 있다는 것이다. Cross bar를 통과한 데이터는 다음 프로세싱 요소로 전달되고 최종적으로 처리가 끝난 패킷은 cross bar를 통해 출력된다. 다음은 계층적 모델의 측면에서 동작을 설명한다. Stallion device from Virginia Tech

Advantages in the Layered Architecture
Defines the methodology to design multimode radios using hardware paging Provides the framework for building a flexible soft radio at the expense of the overhead for packetizing data. Excellent hardware reusability Build libraries of hardware functions much like software’s Good data flow properties and simple interface between the processing layer modules. Layered architecture를 사용함으로서 얻는 이득 Hardware적인 paging을 이용하여 다양한 기능을 구현할 수 있음 데이터를 packet화 하기 위한 유연한 soft radio의 틀을 제공하여 준다 하드웨어의 재 사용 소프트웨어의 라이브러리와 같은 개념의 도입 가능 processing layer module사이에 데이터 흐름의 잇점과 간편한 interface를 제공

Application Layer Software
Stream-based design Processing Processing Stream Packet Stream Packet Stream Packet Element 1 Element 2 Configuration Application Layer Software Pipeline I/O Layer Re- Constr. Stallion processor는 각 프로세싱 요소의 통과는 패킷 단위로 파이프라인 되어 전송되는 것과 같다. 이 파이프라인 전송은 cross bar의 스위치를 통한 것이 아니고 각 프로세싱 요소들을 통과함으로써 발생하는 cycle 지연에 의한 파이프라인이다. 따라서 각 프로세싱 요소를 통과하는 방식을 패킷정보를 수정하여 변경함으로서 처리될 내용의 변경이 용이하다. 즉 동작 주기에 따라 시간 단계별로 처리 됨으로써 요구되는 동작을 순차적으로 구성하고 이 구성을 변경하기 유리하다는 것이다. 전체 스트림은 응용 계층 소프트웨어에 의해 제어되고 다음과 같은 세가지 계층 구조에 의해서 동작한다. [1] I/O layer : 프로세서에의 접속을 정의한다. [2] Configuration layer : 프로세싱 계층의 처리에 사용되는 프로세싱 요소들의 목록을 저장하고 다음 단계의 프로세싱 요소의 주소들을 지정하는 등 실행될 명령의 구성을 변경한다. [3] Processing layer : 실제 각각의 연산들이 수행되는 계층. 프로세싱 요소는 내부적으로 입력된 패킷을 해석하고 처리된 패킷을 다음 주소지로 보내기 위한 패킷를 재 작성한다. 처리 과정에는 configuration layer에서 지정한 목적지 정보등의 경로 설정 처리와 실제 연산을 위한 파이프라인 그리고 처리되지 않 경로를 건너뛰기 위한 bypass 파이프라인 처리의 3가지로 구성된다. 이러한 프로세서 구조는 각 프로세싱 요소가 경로 설정을 처리하므로 복잡한 구성을 취하게 되는 단점이 존재하며, 또한 파이프라인 구성이라는 연산 지연을 갖는다. Interpret Processing Packet Pipeline Configuration Layer Packet Processing Layer Bypass Pipeline

Bus-Based vs. P2P Communication
R. Marculescu Buses Interconnections become dominant in DSM Huge bandwidth requirements (tens of Gb/s for some applications) (buses are not scalable!) Expanding market of mobile and other low-power applications Increasing cooling costs (buses consume too much power!) P2P Communication Faster; no bus contention, no bus arbitration Low-power solution Can be independently optimized May need more wiring resources

System Inputs R. Marculescu A set of IPs:
Hard IP (Width*length, provided by different IP providers) Soft IP (Size provided by synthesis or estimation) Communication Task Graph (CTG)

Target Platform R. Marculescu

MPEG-2 Video Encoder R. Marculescu

Energy Comparison R. Marculescu

Packet-Based On-Chip Communication: Regular Architecture
R. Marculescu

Energy-Aware Mapping for Tile-based Architectures
R. Marculescu Objective: minimize the total communication energy consumption Constraint: meet the communication performance constraints (specified by designer) For a 4X4 tile architecture, 16! mappings

Tile-based Architecture Platform
R. Marculescu

Network-centric Power Management
R. Marculescu Network-centric Power Management Ability to make better predictions about the future workloads Network power management adds very few overhead packets to the overall communication stream between cores Amount of energy wasted while the core is idle is reduced, as the local PM knows ahead of time that no requests are arriving in near future

NoC protocols must be tolerant to common faults
R. Marculescu NoC protocols must be tolerant to common faults Data upsets: Crosstalk, EMI Buffer overflows Node/link failures Synchronization errors

Wires-Centric Design Exploits logic structure to reduce wire loads
Enables use of advanced circuits wire properties and crosstalk known early and well characterized Gives a stable design key wire loads don’t change with small logic changes

Wires dominate - power, area, delay
Problem - Contemporary tools leave wires as an afterthought result is lack of structure, visibility, and control Solution 1 - wires first design route key wires, then place gates Solution 2 - route packets, not wires on-chip networks global wires fixed before the design starts

Wires-first design

On-Chip Interconnection Networks
Replace dedicated global wiring with a shared network Dedicated wiring Network

Most Wires are Idle Most of the Time
Don’t dedicate wires to signals, share wires across multiple signals Route packets not wires Organize global wiring as an on-chip interconnection network allows the wiring resource to be shared keeping wires busy most of the time allows a single global interconnect to be re-used on multiple designs makes global wiring regular and highly optimized

Dedicated wires vs. Network

Power consumption of CMOS circuits
P =  · CL · f · Vdd2 +  · ISC · tsc · f · Vdd + IDC · Vdd + ILEAK · Vdd Charging & discharging Crowbar current Static current Subthreshold leakage current

Vdd, power, and current trend
2.5 2.0 200 500 Voltage Power 1.5 1.0 0.5 0.0 Voltage Current Power per chip [W] VDD current [A] Year International Technology Roadmap for Semiconductors 1998 update

New Computing Platforms
SOC power efficiency more than 10GOPs/w Higher On Chip System Integration: COTS: 100W, SOAC:10W (inter-chip capacitive loads, I/O buffers) Speed & Performance: shorter interconnection,fewer drivers,faster devices,more efficient processing artchitectures Mixed signal systems Reuse of IP blocks Multiprocessor, configurable computing Domain-specific, combined memory-logic

Power-distribution in integrated PicoRadio (total: 100 mW)
Jan M. Rabaey

Web browsing is slow with 802.11 PSM
Son! Haven’t I told you to turn on power-saving mode. Batteries don’t grow on trees you know! But dad! Performance SUCKS when I turn on power-saving mode! So what! When I was your age, I walked 2 miles through the snow to fetch my Web pages! Users complain about performance degradation

LOW Power Methods

Levels for Low Power Design
System Algorithm Architecture Circuit/Logic Technology Hardware-software partitioning, Complexity, Concurrency, Locality, Parallelism, Pipelining, Signal correlations Sizing, Logic Style, Logic Design Threshold Reduction, Scaling, Advanced packaging Regularity, Data representation Instruction set selection, Data rep. SOI Power down Level of Abstraction Expected Saving Algorithm Architecture Logic Level Layout Level Device Level times % % %

System Level Power Optimization
Algorithm selection / algorithm transformation Identification of hot spots Low Power data encoding Quality of Service vs. Power Low Power Memory mapping Resource Sharing / Allocation

Flow C/C++ Compilation Program Execution
Building design representation Loading profiling data Setting constraints Power estimation Identification of Hot Spots

Power-hungry Applications
Signal Compression: HDTV Standard, ADPCM, Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage management Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon decoders

Clock Network Power Managements
50% of the total power FIR (massively pipelined circuit): video processing: edge detection voice-processing (data transmission like xDSL) Telephony: 50% (70%/30%) idle, 동시에 이야기하지 않음. with every clock cycle, data are loaded into the working register banks, even if there are no data changes.

Low Power Design Flow I

Low Power Design Flow II

Why 알고리즘-아키텍쳐 codesign?
Total Power Consumption = transmission power + radio power consumption ACM PSM DVS GC LCC

Three Factors affecting Energy
Reducing waste by Hardware Simplification: redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing All in one Approach(SOC): I/O pin and buffer reduction Voltage Reducible Hardwares 2-D pipelining (systolic arrays) SIMD:Parallel Processing:useful for data w/ parallel structure VLIW: Approach- flexible

IBM’s PowerPC Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution five instruction in parallel (IU, FPU, BPU, LSU, SRU) , RISC FPU is pipelined so a multiply-add instruction can be issued every clock cycle Low power 3.3-volt design 603e provides four software controllable power-saving modes. Copper Processor with SOI IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times

Example 1: Filter: Eliminating Redundant Computations

Example2: IBM’s PowerPC Lower Power Architecture
Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU) FPU is pipelined so a multiply-add instruction can be issued every clock cycle Low power 3.3-volt design Use small complex instruction with smaller instruction length IBM’s PowerPC 603e is RISC Superscalar: CPI < 1 603e issues as many as three instructions per cycle Low Power Management 603e provides four software controllable power-saving modes. Copper Processor with SOI IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times

Power-Down Techniques
Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work

Voltage vs Delay Use Variable Voltage Scaling or Scheduling for Real-time Processing Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing.

Low Voltage Main Memories

Why Copper Processor? Motivation: Aluminum resists the flow of electricity as wires are made thinner and narrower. Performance: 40% speed-up Cost: 30% less expensive Power: Less power from batteries Chip Size: 60% smaller than Aluminum chip

Silicon-on-Insulator
How Does SOI Reduce Capacitance ? Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate high performance, low power, low soft error

Partitioning Performance Requirements
몇몇의 Function들은 Hardware로의 구현이 더 용이 반복적으로 사용되는 Block Parallel하게 구성되어 있는 Block Modifiability Software로 구성된 Block은 변형이 용이 Implementation Cost Hardware로 구성된 Block은 공유해서 사용이 가능 Scheduling 각각 HW와 SW로 분리된 Block들을 정해진 constraints들에 맞출 수 있도록 scheduling SW Operation은 순차적으로 scheduling되어야 한다 Data와 Control의 의존성만 없다면 SW와 HW는 Concurrent하게 scheduling

Low power partitioning approach
Different HW resources are invoked according to the instruction executed at a specific point in time During the execution of the add op., ALU and register are used, but Multiplier is in idle state. Non-active resources will still consume energy since the according circuit continue to switch Calculate wasting energy Adding application specific core and partial running Whenever one core performing, all the other cores are shut down

Low power partitioning
- Derives a graph G - operation and connection - Decomposition of G into a set of clusters - cluster : set of operation - Calculate bus-traffic energy - Pre-select clusters with constraints - Set the number of resources - List scheduling - Test the utilization rate (ASIC or µP) - the utilization rate of µP is supported by SW estimation tool

Design Flow Application - Core Energy Devide Estimation Appliction in cluster Compute utilization List schedule rate(uP) Compute Evaluate utilization S rate(ASIC) - Max 94% energy saving and in most case even reduced execution time - 16k sell overhead Select cluster HW Synthesis

Low Power CDMA Searcher Project
과제명: IS-95기반의 DS/CDMA 시스템 Co-design 기법을 이용한 저전력 설계 개발기간: :28 (12개월) 개발 목적 및 방법: CDMA 단말기에 사용하기위한 MSM (Mobile Station Modem) 칩의 탐색자 (Searcher Engine)에 대한 RTL수준 저전력 설계 구현. 동작 주파수 : 12.5MHz Data flow graph를 사용하여 rescheduling, pre-computation 및 strength reduction, Synchronous Accumulator를 이용한 저전력 설, area와 power를 각각 최대 67.68%, 41.35% 감소 시킴. H/W and S/W Co-design 기법 적용 San Kim and Jun-Dong Cho, “Low Power CDMA Searcher”, CAD and VLSI Workshop, May Inki Hwang, San Kim and Jun-Dong Cho, “CDMA Searcher Co-Design”, ASIC Workshop, Sep

Partitioning Example: CDMA Searcher- vada Lab. SKKU

CDMA Searcher - Software oriented design - Dark block : Hardware
- Interface : Control signal gen. - Partitioned in terms of speed cost - Change from SW to HW 1. Implementation speed 2. Parallel architecture

Result -vada Lab. SKKU

VLSI Signal Processing Design Methodology
pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering bit-serial, bit-parallel and digit-serial architectures, carry save architecture redundant and residue systems Viterbi decoder, motion compensation, 2D-filtering, and data transmission systems

Low Power DSP DO-LOOP Dominant VSELP Vocoder : 83.4 %
2D 8x8 DCT : 98.3 % LPC computation : 98.0 % DO-LOOP Power Minimization ==> DSP Power Minimization VSELP : Vector Sum Excited Linear Prediction LPC : Linear Prediction Coding

Loop unrolling The technique of loop unrolling replicates the body of a loop some number of times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality. Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated.

Loop Unrolling (IIR filter example)
loop unrolling : localize the data to reduce the activity of the inputs of the functional units or two output samples are computed in parallel based on two input samples. Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation, The transformation yields critical path of 3, thus voltage can be dropped.

Loop Unrolling for Low Power

Effective Resource Utilization

Encoding Bus-invert (BI) code Appropriate for random data patterns Redundant code (1 extra bus line) Reduce avg. transitions up to 25% X D Z Majority voter D inv Z X inv R. J. Fletcher, “Integrated circuit having outputs configured for reduced state changes,” May 1987, U.S. Patent M. R. Stan and W. P. Burleson, “Bus-invert coding for low-power I/O,” IEEE Tr. on VLSI Systems, Mar. 1995, pp

Domain Specific Processor: Flexibility vs
Domain Specific Processor: Flexibility vs. Energy-Efficiency Arthur Abnous and Jan Rabaey Programmability requires generalized computation, storage, and communication system, which can be used to implement different kinds of algorithms Domain specific processors preserve the flexibility of a general purpose programmable device to achieve higher levels of energy-efficiency, while maintaining the flexibility to handle a variety of algorithms Trade-off between efficiency and flexibility, programmable designs incur significant performance and power penalties compared to ASIC. The parallel algorithm of signal processing can be achieved significant power savings by executing the dominant computational kernels of a given class of applications with common features on dedicated, optimized processing elements with minimum energy overhead.

Hybrid Architecture Template (Pleiades) Arthur Abnous and Jan Rabaey
Pleiades does much better on the energy scale than the TI DSPs. Because DSPs are general-purpose, and instruction execution involves a great deal of overhead. Pleiades has the ability to create dedicated hardware structures tuned to the task at hand and executes operations with a small energy overhead

Application Domains : ULTRA-LOW-POWER DOMAIN-SPECIFIC MULTIMEDIA PROCESSORS
CELP- Based Speech Coding LPC Analysis and Synthesis Codebook Search Lag Computation DCT- Based Video Compression and Decompression DCT and Inverse- DCT Motion Estimation and Compensation Huffman Coding and Decoding Baseband Processing for Digital Radios Demodulation, Channel Equalization Timing Recovery, Error Correction

The Re-configurable Terminal

Satellite Processors

Elements of Energy- Efficiency

Multi-Processor Implementation

Communication Network

Distributed Data- Driven Control
Execution of a hardware module is triggered by the arrival of tokens. When there are no tokens to be processed at a given module, no switching activity occurs in that module.

Design Methodology

Switching Activity Reduction
(a) Average activity in a multiplier as a function of the constant value (b) A parallel and serial implementations of an adder tree.

VSELP Synthesis Filter Mapped onto Satellite Processors

Mappings of VSELP Kernel
The most energy efficient CELP-based speech algorithm - dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS) - requires 23.4 MOPS Proposed VSELP speech coder um CMOS - dissipates under 5 mW

FFT Mapping

FFT Comparison

Reconguration for Power Saving in Real-Time Motion Estimation,S. R
Reconguration for Power Saving in Real-Time Motion Estimation,S.R.Park,UMASS

Motion Estimation

Block Matching Algorithm

Configurable H/W Paradigms

Why Hardware for Motion Estimation?
Most Computationally demanding part of Video Encoding Example: CCIR 601 format 720 by 576 pixel 16 by 16 macro block (n = 16) 32 by 32 search area (p = 8) 25 Hz Frame rate (f frame = 25) 9 Giga Operations/Sec is needed for Full Search Block Matching Algorithm.

Why Reconguration in Motion Estimation?
Motion Vector Distributions Adjusting the search area at frame-rate according to the changing characteristics of video sequences Reducing Power Consumption by avoiding unnecessary computation

Architecture for Motion Estimation
From P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995

Re-configurable Architecture for ME

Power Estimation in Recongurable Architecture

Power vs Search area

Resource Reuse in FPGAs

Motion Estimation - Conventional

Motion Estimation - Data Reuse

Vector Quantization Lossy compression technique which exploits the correlation that exists between neighboring samples and quantizes samples together

Complexity of VQ Encoding
The distortion metric between an input vector X and a codebook vector C_i is computed as follows: Three VQ encoding algorithms will be evaluated: full search, tree search and differential codebook tree-search.

Full Search Brute-force VQ: the distortion between the input vector and every entry in the code-book is computed, and the codeindex that corresponds to the minimum distortion is determined and sent over to the decoder. For each distortion computation, there are 16 8-bit memory accesses (to fetch the entries in the codeword), 16 subtractions, 16 multiplications, 15 additions. In addition, the minimum of 256 distortion values, which involves 255 comparison operations, must be determined.

Tree-structured Vector Quantization
If for example at level 1, the input vector is closer to the left entry, then the right portion of the tree is never compared below level 2 and an index bit 0 is transmitted. Here only 2 x log = 16 distortion calculations with 8 comparisons

Algorithmic Optimization
Minimizing the number of operations example video data stream using the vector quantization (VQ) algorithm distortion metric Full search VQ exhaustive full-search distortion calculation : 256 value comparison : 255 Tree-structured VQ binary tree-search some performance degradation distortion calculation : 16 ( 2 x log2 256 ) value comparison : 8 1 2 3 8

Differential Codebook Tree-structure Vector Quantization
The distortion difference b/w the left and right node needs to be computed. This equation can be manipulated to reduce the number of operations .

Algorithmic Optimization
Differential codebook tree-structure VQ modify equation for optimizing operations # of mem. access algorithm # of mul. # of add. # of sub full search 4096 4096 3840 4096 tree search 256 256 240 264 differential tree search 136 128 128

Multiplication and Accumulation: MAC
Major operation in DSP [ Modified Booth Encoding ] One of 0, X, -X, 2X, -2X based on each 2 bits of Y X X Y Y MULT ALU ACC PR CSA CPA MUL > (5 * ALU) PR

Operand Swapping (1/2) Y= 00111100 00X000X0 Weight = 2 7FFF AAAA 0001
Weight = how many additions are needed ? By Booth Encoding 00X000X0 Y= Weight = 2 7FFF AAAA 0001 6666 A B A*B B*A 22.0 31.6 28.8 10.0 12.2 Saving 54% 68% 58% Current (mW) Operands Low Weight High Switching

Parallel Processing (1)
Designing a Parallel FIR System To obtain a parallel processing structure, the SISO(single-input single-output) system must be converted into a MIMO(multiple-input multiple-output) system. y(3k) = ax(3k)+bx(3k-1)+cx(3k-2) y(3k+1) = ax(3k+1)+bx(3k)+cx(3k-1) y(3k+2) = ax(3k+2)+bx(3k+1)+cx(3k) Parallel Processing systems are also referred to as block processing systems.

Parallel processing architecture for a 3-tap FIR filter (with block size 3)

DIGLOG multiplier 1st Iter 2nd Iter 3rd Iter
Worst-case error % % % Prob. of Error<1% 10% % % With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)

Voltage Scaling Merely changing a processor clock frequency is not an effective technique for reducing energy consumption. Reducing the clock frequency will reduce the power consumed by a processor, however, it does not reduce the energy required to perform a given task. Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work.

OS: Voltage Scaling

Different Voltage Schedules
5 10 15 20 25 Time(sec) 5.02 1000Mcycles 50MHz 40J (A) Timing constraint 32.5J 5.02 750Mcycles 50MHz 250Mcycles 25MHz (B) Energy consumption (µ Vdd2) 2.52 5 10 15 20 25 Time(sec) 5.02 25J 4.02 (C) 1000Mcycles 40MHz Time(sec) 5 10 15 20 25

OS: Voltage Scheduling

Scale Supply Voltage with fCLK

Adaptive Power Supply Voltages

Data Driven Signal Processing
The basic idea of averaging two samples are buffered and their work loads are averaged. The averaged workload is then used as the effective workload to drive the power supply. Using a pingpong buffering scheme, data samples In +2, In +3 are being buffered while In, In +1 are being processed.

Example of Buffering

RTL: Multiple Supply Voltages Scheduling Filter Example

SOC CAD Companies Avant! www.avanticorp.com Cadence www.cadence.com
Duet Tech Escalade Logic visions Mentor Graphics Palmchip Sonic Summit Design Synopsys Topdown design solutions Xynetix Design Systems Zuken-Redac

Low Power Multimedia Reconfigurable Platforms

Similar presentations

Presentation on theme: "Low Power Multimedia Reconfigurable Platforms"— Presentation transcript:

Similar presentations

About project

지원

로그인

Auth with social network:

Low Power Multimedia Reconfigurable Platforms

Similar presentations

Presentation on theme: "Low Power Multimedia Reconfigurable Platforms"— Presentation transcript:

Similar presentations

About project

지원