6 ALU Blocks and Control Contents 1. Adder 2. Multiplier

Slides:

Advertisements

Similar presentations

1 Chapter 2 Basic Physics of Semiconductors  2.1 Semiconductor materials and their properties  2.2 PN-junction diodes  2.3 Reverse Breakdown.

Advertisements

Chapter 9. 컴퓨터설계기초 9-1 머리말 9-2 데이터 처리장치 (Datapath)

Chapter 4. Post Layout Simulation

Sources of the Magnetic Field

Chapter 7 ARP and RARP.

ASIC의 개요 ASIC(Application Specific Integrated Circuit) 특정 용도 주문형 집적회로

Chapter 3 데이터와 신호 (Data and Signals).

Chaper 2 ~ chaper 3 허승현 제어시스템 설계.

Project . A/D Converter AD Converter using for MOSFET 무한도전 팀명 : 무한도전

Digital Logic Structures

축산 인식개선을 위한 농협의 추진 사례 ( ) 농협중앙회 축산지원단장 박인희.

기본 컴퓨터 프로그래밍 Lecture #6.

디지털 시스템 2010년 1학기 교수: 송상훈 연구실: 율곡관 603-B

Computer System Architecture

7 조합논리회로 IT CookBook, 디지털 논리회로.

컴퓨터 과학 개론 √ 원리를 알면 IT가 맛있다 컴퓨터 과학도를 위한 첫 전공서 ehanbit.net.

2 Part 전자계산기 구조 1. 논리 회로 2. 자료 표현 및 연산 3. 명령어 및 프로세서 4. 명령 수행 및 제어 5.

4 컴퓨터에서 활용되는 디지털 논리회로 IT CookBook, 컴퓨터 구조와 원리 2.0.

32비트 캐리 예측 덧셈기(CLA) RCA(Ripple Carry Adder)

제어기술 소개 목표 : 제어기의 종류, 제어 방식 등을 살펴본다. 주요내용 제어기의 종류 제어방식 : 시퀀스, 피드백, 등.

디지털 산술과 연산회로.

Chapter 10 Differential Amplifiers

10장 주변장치 (PIO) Slide 1 (of 28).

REINFORCEMENT LEARNING

7장 : 캐시와 메모리.

3장 MPU 내부구조 Slide 1 (of 28).

임베디드 하드웨어 Lecture #6.

Fourier Series and Fourier Transform

Computer System Architecture

COMPUTER ARCHITECTIRE

DSP와 TMS320F28x의 이해.

EPS Based Motion Recognition algorithm Comparison

Ch2-2. VHDL Basic VHDL lexical element VHDL description

신장규*, 윤의식** *경북대학교 전자전기컴퓨터학부 ** KAIST 전자전산학과

6장. 물리적 데이터베이스 설계 물리적 데이터베이스 설계

Ch. 5 : Analog Transmission

하드웨어 구현 - A/D 변환기(A/D converter) - 샘플링 주파수(Sampling frequency)

Chapter 2. Finite Automata Exercises

Cluster Analysis (군집 분석)

Analog IC design 3주차 Oct.30th Multimedia Lab..

Medical Instrumentation

Chapter 4 The Von Neumann Model.

Chapter 1 Welcome Aboard.

파이프라이닝.

6장 연산 장치 6.1 개요 6.2 연산장치의 구성요소 6.3 처리기 6.4 기타 연산장치.

Computer System Architecture

Inferences concerning two populations and paired comparisons

Chapter 12 Memory Organization

Machine architecture Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Chapter 2.

Optimal placement of MR dampers

제5강 처리 장치 2.

알고리즘의 분석(analysis) 공간적 효율성(Space Efficiency)과 시간적 효율성(Time Efficiency)

이산수학(Discrete Mathematics)

IBM Corporation {haoxing, eleve, kravets,

(1) 필터 구조마다 유한 정세도 특성(finite precision characteristics)이 다름.

Layout XOR(LVS 후 출력 파형 검사) Rising delay: =0.837 nS

점화와 응용 (Recurrence and Its Applications)

DEGITAL LOGIC CIRCUIT Term Project – 4 bit ALU.

Machine architecture Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Chapter 2.

이산수학(Discrete Mathematics)

알고리즘의 분석(analysis) 공간적 효율성(Space Efficiency)과 시간적 효율성(Time Efficiency)

Machine architecture Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Chapter 2.

Introduction to Computer System 컴퓨터의 이해 3: 데이터 표현

임베디드 하드웨어 Lecture #6.

제10장. Other Models of TM’s 학습목표

[CPA340] Algorithms and Practice Youn-Hee Han

경사 식각을 이용한 폴리머 광 스위치 2층 배선 기술

알고리즘의 분석(analysis) 공간적 효율성(Space Efficiency)과 시간적 효율성(Time Efficiency)

Chapter 7: Deadlocks.

Presentation transcript:

6 ALU Blocks and Control Contents 1. Adder 2. Multiplier 3. Datapath Generation

1. Adder Full Adder Boolean equation Sum(Odd Parity) A×B×C CARRY A+B+C

Which is better? Boolean Equation 1 : Boolean Equation 2 : CARRY evaluation is more urgent since CARRY is in the critical path S0 S1 S2 Sn C1 C2 Cn Cn ADDER ADDER ADDER ADDER C0 A0 B0 A1 B1 A2 B2 An Bn [ Ripple Carry Adder ]

Alternating Complementary Form At Odd Stages At Even Stages A B C SUM CARRY SUM CARRY A B C SUM SUM CARRY CARRY

Alternating Complementary Form

Dynamic Serial Adder A A SUM S B B CARRY C R/S Q D CLOCK

Dynamic Configuration CARRY GATE SUM GATE OPTIONAL PRECHARGE DEVICE CK CK CK A B A C B A A SUM CK B OPTIONAL PRECHARGE DEVICE C B C S R CK CK C (CARRY) CK CK R Set/Reset Circuit S

Full Adder Truth Table Conjugate Symmetry Mutually Complement 1 2 3 7 CARRY SUM Mutually Complement 1 1 1 2 1 1 1 2 3 FC - on terms 3 1 1 1 4 1 1 7 6 5 4 FS - on terms 5 1 1 1 6 1 1 1 7 1 1 1 1 1 Conjugate Symmetry

Another Configuration of Carry & Sum Logic B CARRY STAGE 1 PROPAGATE 1 GENERATE CARRY SUM STAGE CARRY B C SUM A

Dynamic full adder using np CMOS logic style

Layout of the dynamic full adder

Looking at the FA Truth Table C CARRY SUM 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Transmission Gate Implementation C C A  B A B SUM A B A  B B A  B A  B CARRY C

CLA (Carry Lookahead Adder) P1 G1 P2 G2 P3 G3 P4 G4 C1 C2 C3 C4 An Bn Gn Pn

Carry bypass structure - basic concept

(N=16)-bit carry bypass adder(each stage: M bits) tp = tsetup + M * tcarry+(N/M - 1) tbypass + M*tcarry+tsum tsetup : time to create G and P signals tcarry : propagation delay through a single bit tbypass : propagation delay through MUX tsum : time to generate sum

Combining 4 Domino Carry Lookahead Blocks Manchester Carry Chain (4-bit) CK G1 P1 G2 P2 G3 P3 G4 P4 P1 P2 P3 P4 C0 C1 C2 C3 C4 MANCHESTER CARRY CHAIN C4 C0 C4 C0 G1 G2 G3 G4 CK C0 C1 C2 C3 C4 Limit @ 4 stages In the worst case, 6 Series Tr.s to the ground.

Improving Worst Case Carry Prop. Time MANCHESTER CARRY CHAIN C0 C4 C0 C4 CK P1 P2 P3 P4 CK

Manchester CC Adder Floorplan Dual CC Scheme One for Carry Prop. The other for off-loading the 1st CC from the SUM-block. GP SUM A4 MANCHESTER CARRY CHAIN MANCHESTER CARRY CHAIN BIT 4 S4 B4 GP SUM A3 MANCHESTER CARRY CHAIN MANCHESTER CARRY CHAIN BIT 3 S3 B3 GP SUM A2 MANCHESTER CARRY CHAIN MANCHESTER CARRY CHAIN BIT 2 S2 B2 SUM GENERATE SUM GENERATE A1 MANCHESTER CARRY CHAIN MANCHESTER CARRY CHAIN BIT 1 S1 B1 C0

CSA (Carry Select Adder) Realization of MUX with restoring logic A4 ~ A7 B4 ~ B7 Carry Selection C81 1 S41 ~ S71 1 S4 ~ S7 A4 ~ A7 B4 ~ B7 C80 C8 S40 ~ S70 Note) Realization of MUX with pass-transistor gates C4 C8 C8 1 C81 C80 C4 C80 C120 A0 ~ A3 B0 ~ B3 C8 C12 C4 C8 C4 C0 S0 ~ S3 C81 C121 C4 C8 S0 ~ S3 Vdd Vdd - Vt Vdd - 2Vt Threshold voltage loss per stage

CSA (Carry Select Adder) For carry propagation, use restoring logic in the alternative pattern A0 ~ A3 B0 ~ B3 C0 S0 ~ S3 C4 C80 C81 C120 C121 C8 Number of bits for each stage ex1) 32-bit case : 4, 4, 5, 6, 7, 6 ( or 4, 4, 5, 6, 6, 7) ex2) 64-bit case : 4, 4, 5, 6, 7, 8, 9, 10

Minimization of Carry Propagation Path Delay Carry Select Scheme (prepare result for each case, Cin=1, Cin=0) Simplify the carry selection using the characteristic between Ci0 & Ci1 Take complement carries alternating the Even and Odd stages Adjust each block size with the consideration to the delay of carry select logic carry propagation delay of each block = = carry propagation delay to the block adjust eg. for 32-bit path 4 4 5 6 6 7

16-bit Linear CSA(Carry Select Adder) M: #of bits/stage N : total # of bits tadd = tsetup + M * tcarry+ (N/M ) tmux + tsum

Square Root CSA tadd = tsetup + M * tcarry+ 2N tmux + tsum N = M + (M+1) + ….. + (M+P-1) = MP + P(P-1)/2 = P2/2 + P(M - 1/2 ) 9 stage

Propagation Delay of Linear and Square Root CSA and linear RCA

Carry Skip Adder Ripple Carry Adder와 CLA Adder의 Compromise a15 b15 a13 G12,15 G8,11 G4,7 c16 c12 c8 c4 c0 P12, 15 P8, 11 P4, 7

 pi’s and gi’s are computed from pi=aibi and gi = aibi  Initially, c4, c8 and c12 are cleared  After 4 clock cycle (at T0+4Tc), G-values are calculated as cout assuming ci=0(P-values are also calculated by then)  At this time (at T0+4Tc), true cout in the first stage, c4 is obtained.  After one, two and three clock cycles respectively, assuming the delay of each AOI gate as Tc true values of c8, c12 and c16 are obtained.  Sum and cout of the last block are obtained at (T0+4Tc+2Tc+4Tc)

Comparison of Carry Select & Carry Skip Adder A 32-bit Carry Select Adder Stage # 1 2 3 4 5 6 32 bit bits/stage 4 4 5 6 7 6 inc. delay 4 1 1 1 1 1 9k2(k2=delay due to 1-bit addition or MUX) A 32-bit Carry Skip Adder Stage # 1 2 3 4 5 6 bits/stage 4 5 6 7 8 2 inc. delay 4 1 1 1 1 2 10k2

Conditional Sum Adder MPX Triple 2-input MUX A2 B2 S21 C31 S20 C30 A1

Carry Lookahead Tree Adder Previous CLA implementation is not very adequate due to fan-in, fan-out problem & irregularity, despite the small(5) number of logic levels. Make it regular, using log2n - logic levels. a3 b3 a2 b2 g3 p3 g2 p2 G2,3 P2,3 G0,3 P0,3 a1 b1 a0 b0 g1 p1 g0 p0 G0,1 P0,1 ai bi gi pi Gj+1,k Pj+1,k Gi,k Pi,k Gi,j Pi,j [ 1st Part ]

Carry Lookahead Tree Adder Cj+1 Ci g2 g0 Gi,j p2 p0 Pi,j C2 C0 Ci G0,1 P0,1 [ 2nd Part ] C0 S3 a3 b3 S2 a2 b2 S1 a1 b1 S0 a0 b0 S3 ai bi C1 C3 C2 gi pi C0 Ci Gj+1,k Pj+1,k C0 Cj+1 Gi,j Pi,j C0 Ci Gi,k Ci Pi,k [ Complete CLA Tree Adder ]

Carry Save Adder Ripple Carry Adder Carry Lookahead Adder CSA (Conditional Sum Adder) CSA (Carry Select Adder) CSA (Carry Skip Adder) CSA (Carry Save Adder) Carry Propagate Adder

Carry Save Adder Carry Save Adder is used wherever a large number of operands have to be added. Previous Cycle Sum Operand Previous Cycle Carry ai bi ci F.A F.A F.A F.A F.A F.A F.A Carry F/F Sum F/F CSA stages F.A F.A F.A F.A F.A F.A F.A F.A F.A F.A F.A F.A CPA F.A F.A F.A F.A F.A F.A

2. Multiplier Add-and-Shift Algorithm multiplicand multiplier 1 + 1 1 + 1 multiplicand multiplier Multiplication procedure by Pencil-and-Paper Method Multiplication procedure by Add-and-Shift Algorithm

The Serial-Parallel Multiplier B D D D D D D D D b2 D b1 D F.A F.A F.A F.A F.A F.A F.A b0 D D D D D D D D Output

4x4 array multiplier

tmult = [(M-1) + (N-1)] * tcarry + (N-1) * tsum+ tand both tcarry and tsum are important Sum and Carry generation time need to be similar.

Carry-save Multiplier(CSM) Rectangular floorplan of CSM

The Modified Booth Algorithm (cont’) Booth Encoder Table Booth Encoder b2k+1 b2k b2k-1 multiplied by 1 1 1 + x + 2x - 2x - x b2k-1 A = b2k b2k-1 b2k 2A b2k+1 negative = b2k+1

Booth Multiplication Example Initial 0 Add -A 2-bit Shift Add 2A 01 11 -A 00 10 +2A 17 -9 Operation -153 +

The Modified Booth Algorithm Let’s consider a number B = (bn-1, bn-2, ... , b1, b0) written in 2’s-complement. B may be rewritten as follows : Example In this equation, the terms in brackets is in the set {-2, -1, 0, 1, 2} n-bit multiplier generates exactly n/2 partial products

Parallel Multiplier Multiplier has two basic operations The generation of partial products The summation of partial products Parallel multiplier avoids the overhead that is due to the separate controls of these two operations We speed up the multiplication The gain in speed is obtained at the expense of extra hardware Parallel multiplier can be implemented so as to support a high rate of pipelining

The Braun Multiplier A straightforward implementation One bit of the new partial product ( ai . bj ) One bit of the previous partial product Carry in In the first four rows there is no horizontal carry propagation (using carry-save adder) a0 b0 a0b0 P0 a1 b1 a1b0 a0b1 P1 a2 b2 a2b0 a1b1 a0b2 P2 a3 b3 a3b0 a2b1 a1b2 a0b3 P3 a3b1 a2b2 a1b3 P4 a3b2 a2b3 P5 a3b3 P6

The Braun Multiplier (cont’) F.A b0 b1 b2 b3 p0 p1 p2 p3 p4 p5 p6 p7 a0 a1 a2 a3

Baugh-Wooley Multiplier Modified in order to allow multiplication of signed number Let’s consider 2 number A and B (2’s complement number) The product A.B is

Baugh-Wooley Multiplier (cont’) F.A b0 b1 b2 b3 p0 p1 p2 p4 p5 p6 p7 p3 1

Wallace Tree Multipliers Full adder vs Wallace tree Useful whenever a large number of operands are to add. Completion time in Braun or Baugh-Wooley multiplier Using Ripple Carry Adder: Proportional to the twice number of n of bits Using Wallace trees, Proportional to log2 (n) Full Adder 20 21 Wallace n 20 2n 21

Recursive Decomposition of the Multiplication Partitioning two operands Four Terms (AH.BH, AH.BL, AL.BH, AL.BL) are computed using 4 p-bits multipliers The results are collected through Wallace tree

Recursive Decomposition of the Multiplication BH BL AH AL AL X BL AH X BL AH X BH AL X BH AL X BL AL X BH AH X BH AH X BL 4 X W3 Adder AH AL BH BL Aligning the four partial products

Booth’s Algorithm Array Multiplication Another approach to the design of a parallel multiplier for two’s complement operands The basic cell in rows i perform an add, subtract or transfer-only CASS (Controlled Add/Subtract/Shift) Cell cin Pin (partial product) a H D cout

Booth’s Algorithm Array Multiplication (cont’) a3 a2 a1 a0 H x3 CTRL D CASS CASS CASS CASS x2 H CTRL D CASS CASS CASS CASS CASS x1 H CTRL D CASS CASS CASS CASS CASS CASS x0 H CTRL D CASS CASS CASS CASS CASS CASS CASS P6 P5 P4 P3 P2 P1 P0 Xi Xi-1 1 Shift Subtract Add d D H

Generalized block diagram of an array multiplier

Q. Why use an array multiplier if it requires as many addition steps? A1) Array multiplier is combinational circuit, where the signals flow without being clocked. Multi-pass Array Multiplier : normally use a clock, but the cycle time for passing through k arrays is < kTc

A2) Some speed-up schemes are possible. e.g. E/O array, Wallace-tree Even-Odd Array

Wallace-tree Multiplier

6 x 6 Wallace-tree Multiplier Example (n : width of the Wallace tree) e.g. For 32-bit, number of adders necessary for each stage is 32 - 22 - 16 - 12 - 8 - 6 - 4 - 3 - 2 Total delay = 9 x adder delay

Datapath and its elements in bit-slice organization 3. Datapath Generation Datapath and its elements in bit-slice organization MEMORY CONTROL INPUT-OUTPUT DATAPATH

Two layout strategies for bit-slice datapath

Layout of 4-bit DP using layout strategy II (feedthrough)

1-D placement vs. 2-D placement

1-D placement vs. 2-D placement(Cont’)

Datapath Layout Flow circuit design layout back-annotation floorplan : block ordering, bus track assignment schematic drawing : tr. sizing layout cell drawing : leaf cell layout layout assemble : leaf cell integration (routing) DRC / LVS : design rule check, layout vs. schematic back-annotation simulation with the exact capacitance RTL description Floorplan Schematic Drawing Cell Drawing Layout Assemble DRC / LVS Back-Annotation Datapath Layout

Datapath Design Case (ACCENT HK386) real mode support of x86 instruction set enhanced (pipelined) datapath problems & practices of general DP layout 차질 요인 분석

Datapath structure 클럭 트리 유성형에게 문의 3 major blocks Segment,EA alu, register file(32bit) barallel shifter(40bit) segment/effective address(32bit) Barrel Shifter ALU Control buffer에 대한 개념 부족 단순히 buffer라고만 생각 compass에 bias된 결과 원래는 control buffer에서 buffering은 물론 decoding까지 해 주어야 함 control buffer영역이 datapath의 각 block(row)의 pitch 완충지대 역할 standard cell로 clock skew를 제대로 제어하기 힘듬 클럭 트리 유성형에게 문의 Register File

Track capacity 6 vertical wires/track in metal 1 VSS VDD TRACK(6) metal2 Control, Clock Power Datapath 자체를 over-the-cell routing하는 것이 훨씬 크기면에서 이로왔을 것이다. Metal3를 사용하지 않은 데에는, metal3가 특성이 나빠서 cap이크다는 이유도 있었음. N-well P-well 6 vertical wires/track in metal 1 metal3 reserved for P & G routing

Power Grid Segment,EA From bottom & left(chip edges) Considering IR drop BSH ALU 둘레를 power line으로 쌓지 않았음 처음부터 고려하지 않았기 때문에 지역마다 power line의 밀도 차가 많음 좌측 상단은 전압이 좀 떨어질 것임 얼마나? RF

Cell Structure Initial cell template decision N-well P-well Nwell in the left Pwell in the right data flow vertical control flow horizontal Similar cell structure as VTI Cell width 80  for PMOS 70  for NMOS 70 80 N-well P-well 옆으로 두개 정도의 tr이 그려질 수 있는 크기로 70, 80으로 했음 inverter기준으로 생각하면 pmos쪽이 2배정도 커야 하지만 nand나, pass tr도 많이 있을 것으로 생각하고 pmos쪽이 약간 만 크도록 했음 over-the-cell routing을 생각한다면 track의 개수를 기준으로 높이를 잡는 것이 올바른 선택임 기본적으로 poly가 control flow와 같은 방향으로 그려지는 것을 기준으로 생각했음(VTI style) Intel쪽은 이렇게 안그린 다고 함 25 10 35 45 10 25

Cell Structure 모든 쎌에 power line이 통과함 power line width 10  (2 contact) power line location  25  to the inside from the boundary 제대로 된 파워폭 얼마로 할지 준서형에게 문의 왜 미니멈의 두배로 하나? 파워 라인 양쪽 옆으로 TR을 둘 수 있게 25lambda를 띄워 놨는데 별료 효과적이지 못함(그림 예 mux쪽)

Accent Cell Layout Flow Block Spec. 처음에 cap을 가정하고 시뮬레이션 TR sizing은 간단하게 끝냄 cap이 정확하지 않으니까 optimize는 필요 없고 spec만 만족하면 된다고 생각함 전체 assemble이 되어야 정확한 cap이 나오므로 한참동안 일에서 손을 뗌 assemble된 다음 layout을 고치면 새로 다시 assemble해야 하는데 엄청난 노가다 Schematic SPICE spice simulation spice 파일을 직접 손보아야 하기 때문에 불편 불편 => 잘 하기 힘들다. Cap이 제대로 맞을 리 없음 따라서 optimize도 제대로 안됨 대체로 bus쪽에 물린 TR은 oversizing되고 internal track에 물린 TR들은 작게 sizing됨 즉, bus cap은 크게 주고, internal track은 무시됨 control signal이 충분히 빨리 와서 대기하고 있을 것으로 가정하는데 control buffer가 없기 때문에 실제로는 그렇지 않은 경우가 대부분

Cell Design(I) Control flow Data flow Using 45 degree line for cell design Control flow Data flow

Cell Design(II) needless effort to reduce cell size Data flow ugly poly; current crowding Data flow

Critical path used for transistor sizing in relevant datapath element Signal flow graph를 그려 가면서 확인해 보면 대채로 oversize되었음을 알 수있다. 버스쪽 신호가 얼마나 늦게 들어올지는 생각을 못함 여기까지는 대채로 잘 진행된 것 처럼 보이나 사실은 assemble을 고려 하지 않았기 때문에 쎌을 많이 수정해야 되는 사황에 부딪힌다.

Assemble Data flow Track assignment needs to be done before the cell layout (not after).

학점의 가치 대학 성적과 사회에서의 성공은 별로 correlation이 없는데, 이것은 사실 신기한 일이 아니다. 사회 성공의 요인과 대학성적 기준이 매우 다르니까.

창업의 갈림길 창업을 하려면 우선 두가지를 명확히 해야 한다. 첫째, 국내시장 만을 target으로 하든지, 둘 중의 하나만 하라. 세계시장에의 도전은 어렵지만, 성공하면 국내시장은 저절로 따라간다.