6 ALU Blocks and Control Contents 1. Adder 2. Multiplier 3. Datapath Generation
1. Adder Full Adder Boolean equation Sum(Odd Parity) A×B×C CARRY A+B+C
Which is better? Boolean Equation 1 : Boolean Equation 2 : CARRY evaluation is more urgent since CARRY is in the critical path S0 S1 S2 Sn C1 C2 Cn Cn ADDER ADDER ADDER ADDER C0 A0 B0 A1 B1 A2 B2 An Bn [ Ripple Carry Adder ]
Alternating Complementary Form At Odd Stages At Even Stages A B C SUM CARRY SUM CARRY A B C SUM SUM CARRY CARRY
Alternating Complementary Form
Dynamic Serial Adder A A SUM S B B CARRY C R/S Q D CLOCK
Dynamic Configuration CARRY GATE SUM GATE OPTIONAL PRECHARGE DEVICE CK CK CK A B A C B A A SUM CK B OPTIONAL PRECHARGE DEVICE C B C S R CK CK C (CARRY) CK CK R Set/Reset Circuit S
Full Adder Truth Table Conjugate Symmetry Mutually Complement 1 2 3 7 CARRY SUM Mutually Complement 1 1 1 2 1 1 1 2 3 FC - on terms 3 1 1 1 4 1 1 7 6 5 4 FS - on terms 5 1 1 1 6 1 1 1 7 1 1 1 1 1 Conjugate Symmetry
Another Configuration of Carry & Sum Logic B CARRY STAGE 1 PROPAGATE 1 GENERATE CARRY SUM STAGE CARRY B C SUM A
Dynamic full adder using np CMOS logic style
Layout of the dynamic full adder
Looking at the FA Truth Table C CARRY SUM 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Transmission Gate Implementation C C A B A B SUM A B A B B A B A B CARRY C
CLA (Carry Lookahead Adder) P1 G1 P2 G2 P3 G3 P4 G4 C1 C2 C3 C4 An Bn Gn Pn
Carry bypass structure - basic concept
(N=16)-bit carry bypass adder(each stage: M bits) tp = tsetup + M * tcarry+(N/M - 1) tbypass + M*tcarry+tsum tsetup : time to create G and P signals tcarry : propagation delay through a single bit tbypass : propagation delay through MUX tsum : time to generate sum
Combining 4 Domino Carry Lookahead Blocks Manchester Carry Chain (4-bit) CK G1 P1 G2 P2 G3 P3 G4 P4 P1 P2 P3 P4 C0 C1 C2 C3 C4 MANCHESTER CARRY CHAIN C4 C0 C4 C0 G1 G2 G3 G4 CK C0 C1 C2 C3 C4 Limit @ 4 stages In the worst case, 6 Series Tr.s to the ground.
Improving Worst Case Carry Prop. Time MANCHESTER CARRY CHAIN C0 C4 C0 C4 CK P1 P2 P3 P4 CK
Manchester CC Adder Floorplan Dual CC Scheme One for Carry Prop. The other for off-loading the 1st CC from the SUM-block. GP SUM A4 MANCHESTER CARRY CHAIN MANCHESTER CARRY CHAIN BIT 4 S4 B4 GP SUM A3 MANCHESTER CARRY CHAIN MANCHESTER CARRY CHAIN BIT 3 S3 B3 GP SUM A2 MANCHESTER CARRY CHAIN MANCHESTER CARRY CHAIN BIT 2 S2 B2 SUM GENERATE SUM GENERATE A1 MANCHESTER CARRY CHAIN MANCHESTER CARRY CHAIN BIT 1 S1 B1 C0
CSA (Carry Select Adder) Realization of MUX with restoring logic A4 ~ A7 B4 ~ B7 Carry Selection C81 1 S41 ~ S71 1 S4 ~ S7 A4 ~ A7 B4 ~ B7 C80 C8 S40 ~ S70 Note) Realization of MUX with pass-transistor gates C4 C8 C8 1 C81 C80 C4 C80 C120 A0 ~ A3 B0 ~ B3 C8 C12 C4 C8 C4 C0 S0 ~ S3 C81 C121 C4 C8 S0 ~ S3 Vdd Vdd - Vt Vdd - 2Vt Threshold voltage loss per stage
CSA (Carry Select Adder) For carry propagation, use restoring logic in the alternative pattern A0 ~ A3 B0 ~ B3 C0 S0 ~ S3 C4 C80 C81 C120 C121 C8 Number of bits for each stage ex1) 32-bit case : 4, 4, 5, 6, 7, 6 ( or 4, 4, 5, 6, 6, 7) ex2) 64-bit case : 4, 4, 5, 6, 7, 8, 9, 10
Minimization of Carry Propagation Path Delay Carry Select Scheme (prepare result for each case, Cin=1, Cin=0) Simplify the carry selection using the characteristic between Ci0 & Ci1 Take complement carries alternating the Even and Odd stages Adjust each block size with the consideration to the delay of carry select logic carry propagation delay of each block = = carry propagation delay to the block adjust eg. for 32-bit path 4 4 5 6 6 7
16-bit Linear CSA(Carry Select Adder) M: #of bits/stage N : total # of bits tadd = tsetup + M * tcarry+ (N/M ) tmux + tsum
Square Root CSA tadd = tsetup + M * tcarry+ 2N tmux + tsum N = M + (M+1) + ….. + (M+P-1) = MP + P(P-1)/2 = P2/2 + P(M - 1/2 ) 9 stage
Propagation Delay of Linear and Square Root CSA and linear RCA
Carry Skip Adder Ripple Carry Adder와 CLA Adder의 Compromise a15 b15 a13 G12,15 G8,11 G4,7 c16 c12 c8 c4 c0 P12, 15 P8, 11 P4, 7
pi’s and gi’s are computed from pi=aibi and gi = aibi Initially, c4, c8 and c12 are cleared After 4 clock cycle (at T0+4Tc), G-values are calculated as cout assuming ci=0(P-values are also calculated by then) At this time (at T0+4Tc), true cout in the first stage, c4 is obtained. After one, two and three clock cycles respectively, assuming the delay of each AOI gate as Tc true values of c8, c12 and c16 are obtained. Sum and cout of the last block are obtained at (T0+4Tc+2Tc+4Tc)
Comparison of Carry Select & Carry Skip Adder A 32-bit Carry Select Adder Stage # 1 2 3 4 5 6 32 bit bits/stage 4 4 5 6 7 6 inc. delay 4 1 1 1 1 1 9k2(k2=delay due to 1-bit addition or MUX) A 32-bit Carry Skip Adder Stage # 1 2 3 4 5 6 bits/stage 4 5 6 7 8 2 inc. delay 4 1 1 1 1 2 10k2
Conditional Sum Adder MPX Triple 2-input MUX A2 B2 S21 C31 S20 C30 A1
Carry Lookahead Tree Adder Previous CLA implementation is not very adequate due to fan-in, fan-out problem & irregularity, despite the small(5) number of logic levels. Make it regular, using log2n - logic levels. a3 b3 a2 b2 g3 p3 g2 p2 G2,3 P2,3 G0,3 P0,3 a1 b1 a0 b0 g1 p1 g0 p0 G0,1 P0,1 ai bi gi pi Gj+1,k Pj+1,k Gi,k Pi,k Gi,j Pi,j [ 1st Part ]
Carry Lookahead Tree Adder Cj+1 Ci g2 g0 Gi,j p2 p0 Pi,j C2 C0 Ci G0,1 P0,1 [ 2nd Part ] C0 S3 a3 b3 S2 a2 b2 S1 a1 b1 S0 a0 b0 S3 ai bi C1 C3 C2 gi pi C0 Ci Gj+1,k Pj+1,k C0 Cj+1 Gi,j Pi,j C0 Ci Gi,k Ci Pi,k [ Complete CLA Tree Adder ]
Carry Save Adder Ripple Carry Adder Carry Lookahead Adder CSA (Conditional Sum Adder) CSA (Carry Select Adder) CSA (Carry Skip Adder) CSA (Carry Save Adder) Carry Propagate Adder
Carry Save Adder Carry Save Adder is used wherever a large number of operands have to be added. Previous Cycle Sum Operand Previous Cycle Carry ai bi ci F.A F.A F.A F.A F.A F.A F.A Carry F/F Sum F/F CSA stages F.A F.A F.A F.A F.A F.A F.A F.A F.A F.A F.A F.A CPA F.A F.A F.A F.A F.A F.A
2. Multiplier Add-and-Shift Algorithm multiplicand multiplier 1 + 1 1 + 1 multiplicand multiplier Multiplication procedure by Pencil-and-Paper Method Multiplication procedure by Add-and-Shift Algorithm
The Serial-Parallel Multiplier B D D D D D D D D b2 D b1 D F.A F.A F.A F.A F.A F.A F.A b0 D D D D D D D D Output
4x4 array multiplier
tmult = [(M-1) + (N-1)] * tcarry + (N-1) * tsum+ tand both tcarry and tsum are important Sum and Carry generation time need to be similar.
Carry-save Multiplier(CSM) Rectangular floorplan of CSM
The Modified Booth Algorithm (cont’) Booth Encoder Table Booth Encoder b2k+1 b2k b2k-1 multiplied by 1 1 1 + x + 2x - 2x - x b2k-1 A = b2k b2k-1 b2k 2A b2k+1 negative = b2k+1
Booth Multiplication Example Initial 0 Add -A 2-bit Shift Add 2A 01 11 -A 00 10 +2A 17 -9 Operation -153 +
The Modified Booth Algorithm Let’s consider a number B = (bn-1, bn-2, ... , b1, b0) written in 2’s-complement. B may be rewritten as follows : Example In this equation, the terms in brackets is in the set {-2, -1, 0, 1, 2} n-bit multiplier generates exactly n/2 partial products
Parallel Multiplier Multiplier has two basic operations The generation of partial products The summation of partial products Parallel multiplier avoids the overhead that is due to the separate controls of these two operations We speed up the multiplication The gain in speed is obtained at the expense of extra hardware Parallel multiplier can be implemented so as to support a high rate of pipelining
The Braun Multiplier A straightforward implementation One bit of the new partial product ( ai . bj ) One bit of the previous partial product Carry in In the first four rows there is no horizontal carry propagation (using carry-save adder) a0 b0 a0b0 P0 a1 b1 a1b0 a0b1 P1 a2 b2 a2b0 a1b1 a0b2 P2 a3 b3 a3b0 a2b1 a1b2 a0b3 P3 a3b1 a2b2 a1b3 P4 a3b2 a2b3 P5 a3b3 P6
The Braun Multiplier (cont’) F.A b0 b1 b2 b3 p0 p1 p2 p3 p4 p5 p6 p7 a0 a1 a2 a3
Baugh-Wooley Multiplier Modified in order to allow multiplication of signed number Let’s consider 2 number A and B (2’s complement number) The product A.B is
Baugh-Wooley Multiplier (cont’) F.A b0 b1 b2 b3 p0 p1 p2 p4 p5 p6 p7 p3 1
Wallace Tree Multipliers Full adder vs Wallace tree Useful whenever a large number of operands are to add. Completion time in Braun or Baugh-Wooley multiplier Using Ripple Carry Adder: Proportional to the twice number of n of bits Using Wallace trees, Proportional to log2 (n) Full Adder 20 21 Wallace n 20 2n 21
Recursive Decomposition of the Multiplication Partitioning two operands Four Terms (AH.BH, AH.BL, AL.BH, AL.BL) are computed using 4 p-bits multipliers The results are collected through Wallace tree
Recursive Decomposition of the Multiplication BH BL AH AL AL X BL AH X BL AH X BH AL X BH AL X BL AL X BH AH X BH AH X BL 4 X W3 Adder AH AL BH BL Aligning the four partial products
Booth’s Algorithm Array Multiplication Another approach to the design of a parallel multiplier for two’s complement operands The basic cell in rows i perform an add, subtract or transfer-only CASS (Controlled Add/Subtract/Shift) Cell cin Pin (partial product) a H D cout
Booth’s Algorithm Array Multiplication (cont’) a3 a2 a1 a0 H x3 CTRL D CASS CASS CASS CASS x2 H CTRL D CASS CASS CASS CASS CASS x1 H CTRL D CASS CASS CASS CASS CASS CASS x0 H CTRL D CASS CASS CASS CASS CASS CASS CASS P6 P5 P4 P3 P2 P1 P0 Xi Xi-1 1 Shift Subtract Add d D H
Generalized block diagram of an array multiplier
Q. Why use an array multiplier if it requires as many addition steps? A1) Array multiplier is combinational circuit, where the signals flow without being clocked. Multi-pass Array Multiplier : normally use a clock, but the cycle time for passing through k arrays is < kTc
A2) Some speed-up schemes are possible. e.g. E/O array, Wallace-tree Even-Odd Array
Wallace-tree Multiplier
6 x 6 Wallace-tree Multiplier Example (n : width of the Wallace tree) e.g. For 32-bit, number of adders necessary for each stage is 32 - 22 - 16 - 12 - 8 - 6 - 4 - 3 - 2 Total delay = 9 x adder delay
Datapath and its elements in bit-slice organization 3. Datapath Generation Datapath and its elements in bit-slice organization MEMORY CONTROL INPUT-OUTPUT DATAPATH
Two layout strategies for bit-slice datapath
Layout of 4-bit DP using layout strategy II (feedthrough)
1-D placement vs. 2-D placement
1-D placement vs. 2-D placement(Cont’)
Datapath Layout Flow circuit design layout back-annotation floorplan : block ordering, bus track assignment schematic drawing : tr. sizing layout cell drawing : leaf cell layout layout assemble : leaf cell integration (routing) DRC / LVS : design rule check, layout vs. schematic back-annotation simulation with the exact capacitance RTL description Floorplan Schematic Drawing Cell Drawing Layout Assemble DRC / LVS Back-Annotation Datapath Layout
Datapath Design Case (ACCENT HK386) real mode support of x86 instruction set enhanced (pipelined) datapath problems & practices of general DP layout 차질 요인 분석
Datapath structure 클럭 트리 유성형에게 문의 3 major blocks Segment,EA alu, register file(32bit) barallel shifter(40bit) segment/effective address(32bit) Barrel Shifter ALU Control buffer에 대한 개념 부족 단순히 buffer라고만 생각 compass에 bias된 결과 원래는 control buffer에서 buffering은 물론 decoding까지 해 주어야 함 control buffer영역이 datapath의 각 block(row)의 pitch 완충지대 역할 standard cell로 clock skew를 제대로 제어하기 힘듬 클럭 트리 유성형에게 문의 Register File
Track capacity 6 vertical wires/track in metal 1 VSS VDD TRACK(6) metal2 Control, Clock Power Datapath 자체를 over-the-cell routing하는 것이 훨씬 크기면에서 이로왔을 것이다. Metal3를 사용하지 않은 데에는, metal3가 특성이 나빠서 cap이크다는 이유도 있었음. N-well P-well 6 vertical wires/track in metal 1 metal3 reserved for P & G routing
Power Grid Segment,EA From bottom & left(chip edges) Considering IR drop BSH ALU 둘레를 power line으로 쌓지 않았음 처음부터 고려하지 않았기 때문에 지역마다 power line의 밀도 차가 많음 좌측 상단은 전압이 좀 떨어질 것임 얼마나? RF
Cell Structure Initial cell template decision N-well P-well Nwell in the left Pwell in the right data flow vertical control flow horizontal Similar cell structure as VTI Cell width 80 for PMOS 70 for NMOS 70 80 N-well P-well 옆으로 두개 정도의 tr이 그려질 수 있는 크기로 70, 80으로 했음 inverter기준으로 생각하면 pmos쪽이 2배정도 커야 하지만 nand나, pass tr도 많이 있을 것으로 생각하고 pmos쪽이 약간 만 크도록 했음 over-the-cell routing을 생각한다면 track의 개수를 기준으로 높이를 잡는 것이 올바른 선택임 기본적으로 poly가 control flow와 같은 방향으로 그려지는 것을 기준으로 생각했음(VTI style) Intel쪽은 이렇게 안그린 다고 함 25 10 35 45 10 25
Cell Structure 모든 쎌에 power line이 통과함 power line width 10 (2 contact) power line location 25 to the inside from the boundary 제대로 된 파워폭 얼마로 할지 준서형에게 문의 왜 미니멈의 두배로 하나? 파워 라인 양쪽 옆으로 TR을 둘 수 있게 25lambda를 띄워 놨는데 별료 효과적이지 못함(그림 예 mux쪽)
Accent Cell Layout Flow Block Spec. 처음에 cap을 가정하고 시뮬레이션 TR sizing은 간단하게 끝냄 cap이 정확하지 않으니까 optimize는 필요 없고 spec만 만족하면 된다고 생각함 전체 assemble이 되어야 정확한 cap이 나오므로 한참동안 일에서 손을 뗌 assemble된 다음 layout을 고치면 새로 다시 assemble해야 하는데 엄청난 노가다 Schematic SPICE spice simulation spice 파일을 직접 손보아야 하기 때문에 불편 불편 => 잘 하기 힘들다. Cap이 제대로 맞을 리 없음 따라서 optimize도 제대로 안됨 대체로 bus쪽에 물린 TR은 oversizing되고 internal track에 물린 TR들은 작게 sizing됨 즉, bus cap은 크게 주고, internal track은 무시됨 control signal이 충분히 빨리 와서 대기하고 있을 것으로 가정하는데 control buffer가 없기 때문에 실제로는 그렇지 않은 경우가 대부분
Cell Design(I) Control flow Data flow Using 45 degree line for cell design Control flow Data flow
Cell Design(II) needless effort to reduce cell size Data flow ugly poly; current crowding Data flow
Critical path used for transistor sizing in relevant datapath element Signal flow graph를 그려 가면서 확인해 보면 대채로 oversize되었음을 알 수있다. 버스쪽 신호가 얼마나 늦게 들어올지는 생각을 못함 여기까지는 대채로 잘 진행된 것 처럼 보이나 사실은 assemble을 고려 하지 않았기 때문에 쎌을 많이 수정해야 되는 사황에 부딪힌다.
Assemble Data flow Track assignment needs to be done before the cell layout (not after).
학점의 가치 대학 성적과 사회에서의 성공은 별로 correlation이 없는데, 이것은 사실 신기한 일이 아니다. 사회 성공의 요인과 대학성적 기준이 매우 다르니까.
창업의 갈림길 창업을 하려면 우선 두가지를 명확히 해야 한다. 첫째, 국내시장 만을 target으로 하든지, 둘 중의 하나만 하라. 세계시장에의 도전은 어렵지만, 성공하면 국내시장은 저절로 따라간다.