CE302 Tutorial questions
Note: We might skip some questions based on progress through the semester. Listen in the lectures for details.
Section A: Background familiarity and foundations
I'm designing a PDA, and have a choice of the following processors. Assuming that I want to run at 20 MIPS, Which one would I choose and why?
|
Name |
Number of transistors |
Maximum clock speed |
Die size (mm2) |
Operating voltage |
Current @ max speed |
MIPS @ max speed |
process (um) |
|---|---|---|---|---|---|---|---|
|
LEG 6 |
700000 |
40MHz |
5.4 |
3.3v |
52mA |
40 |
0.3 |
|
Octium III |
1300000 |
80MHz |
10.6 |
2.7v |
120mA |
60 |
0.5 |
|
PIG 16C84 |
600000 |
20MHz |
4.9 |
5v |
25mA |
18 |
0.6 |
|
picoBALI |
1100000 |
70MHz |
8.2 |
5v |
90mA |
60 |
0.2 |
Start by listing parameters that allow the four processors to be compared like-for-like.
I wrote a 'C' program to store 4 bytes (0, 1, 2, 3) to consecutive memory locations, and ran this on a little-endian computer with 32-bit wide memory. When I then examined memory would I see something like A or B:
|
bit 31 |
|
|
bit 0 |
||
|---|---|---|---|---|---|
|
A: |
3 |
2 |
1 |
0 |
|
|
bit 31 |
|
|
bit 0 |
||
|---|---|---|---|---|---|
|
B: |
0 |
1 |
2 |
3 |
|
Complete the following table (for 8-bit binary numbers), indicate if conversion is impossible:
|
Decimal |
Unsigned |
Two's complement |
Sign-magnitude |
|---|---|---|---|
|
123 |
|
|
|
|
-15 |
|
|
|
|
193 |
|
|
|
|
-127 |
|
|
|
With a two's complement (2.30) format number, how do we represent the value 0.783203125? Can this be represented exactly with 32 bits, 16bits and 8bits?
One BCD digit consists of 4 bits. Starting with a 4-bit ripple-carry adder, modify this with extra single-bit adders and logic gates to create an adder that can add two BCD digits, and produce a BCD sum. Extend the design so that it can add two 4 digit BCD numbers.
Using partial products (long multiplication), manually multiply the two 4-bit binary numbers X=1011 and Y=1101 assuming they are a) unsigned numbers, b) two's complement numbers.
Repeat the multiplication of question 6a using Booth's algorithm.
If ADD, SHIFT and compare operations each require a single CPU cycle to complete, how many CPU cycles are needed to perform the calculation in question 6a? Compare this with Booth's question 7: is Booth's algorithm more efficient for a larger word width?
Consider a RISC CPU that has an instruction MUL that can multiply the contents of two registers and store the result into a third register. The registers are 32-bits wide, and the stored result is the top 32-bits of the 64-bit logical result (remember that 32-bits * 32-bits should give 64-bits). However the programmer wants to determine the full 64-bit result..... how can he get it? (hint: you will need to do more than 1 multiply, and also a few ANDs and adds to get the result). Verify your method, and determine how many instructions are needed.
If we multiply the two (2.6) format unsigned numbers X=11010000 and Y=01110000 then we should get a (4.12) format result. We can shift the result two digits left, giving (2.14) [i.e. effectively removing the top two bits] and then truncate it to (2.6) [by discarding the lower 8 bits]. Will this cause an overflow, and will the truncation lose any bits?
a) Consider the IEEE754 single-precision floating point standard. Express the value of the stored number N in terms of its storage bits (sign,E,S) for the following cases:
E=255, S is non 0
E=255, S=0
0 < E < 255
E=0, S is non 0
E=0, S=0
Can a standard exponent/mantissa floating point number format represent zero in more than one way? Can IEEE754 represent zero in more than one way? Explain any differences....
Only some numbers can be represented exactly with a finite number of bits. If A' is he stored value representing a real value A, then the relative error r=(A-A')/A. Represent the decimal number 0.81 as a floating point number using a 4-bit biased integer exponent with base-2 and with a 6-bit mantissa. What is the relative error?
Section B: The CPU
Draw a 2-bit parallel master-slave register constructed from flip-flops. Show the inputs and outputs of the flip-flops, and their clock signals. Draw a timing diagram for their clock inputs to show the following sequence of operations. Show the master clock running at twice the register cycle frequency:
(i) data being clocked into the
register
(ii) data being output from the
register for one cycle
(iii) no output or input for one cycle
(iv) data being output from the
register for one cycle
(v) new data being clocked in
Consider the block diagram of an ALU and 3 registers connected in a 3 bus CPU as shown below. Assume that this diagram is complete except for a memory interface to each bus, and that memory transfers are much slower than register data movements.

a) Design a 2-bit ALU with inputs A and B that contains two 1-bit full adders and circuitry to perform AND, OR, ADD and SUBTRACT operations. Assume the inputs are unsigned.
b) If each logic gate has a 10ns propagation delay, and each 1-bit full adder has a 30ns delay between any input and any output, what would be the maximum operating frequency of your ALU?
Show how four of those 2-bit ALUs can be combined to make an 8-bit ALU (unsigned). How would you modify the design to cope with signed numbers.
The following pseudo-code segment is executed on a RISC processor.
loop i=0,1read X from memory address 0 read Y from memory address i Z=X+Y write Z to memory address i+1
The processor takes 1 cycle to complete all internal operations (including cache accesses).
Saving data from cache to RAM takes 4 cycles.
Loading data from RAM to cache takes 4 cycles (+ 1 cycle to continue from cache to CPU)
Assuming that the system has a direct cache which is initially empty, how many cycles are required for this code if the cache uses the following policies;
(a) write back
(b) write through with no write allocate (WTNWA)
(c) write through with write allocate (WTWA)
If the assembler instrution LSL means "logical shift left", LSR means "logical shift right", ASL means "arithmetic shift left", and ASR means "arithmetic shift right" then what are the results of doing this to the following signed 16-bit numbers:
(i) 0x00CA ASR 1
An analysis of representative code for a RISC processor (with only 8 instructions) finds the following occurrences of those instruction:
|
Instruction |
Number of occurrences |
|---|---|
|
ADD |
30 |
|
AND |
22 |
|
LDR |
68 |
|
MOV |
100 |
|
NOT |
15 |
|
ORR |
10 |
|
STR |
60 |
|
SUB |
6 |
If each instruction (excluding operands) is 6-bits long, how many bits does the program occupy?
Use the information in the table to design a Huffman coding for the processor.
Calculate the number of bits needed to store the program using the Huffman coded instruction set.
Design a Huffman inverted-tree decoder to verify your answer to the question 20.
Show the sequence of stack PUSHes and POPs during the execution of the following reverse polish notation (RPN) operations, and translate each into infix notation:
(i) ab+On some pipelined processors, a conditional branch can cause a stall or wasted cycle.
a) The following code segment might stall a 3-stage pipeline. Why?
note: an 'S' after the instruction means it sets the condition codes - no S means that condition codes will not be set, and assume that every instruction completes in a single pipeline cycleMOV R0,R3 ;R0=R3 ORR R4,R3,R5 ;R4=R3 or R5 AND R7,R6,R5 ;R6=R4 and R5 ADDS R0,R1,R2 ;R0=R1+R2, set condition codes BGT loop ;Branch if the result is greater than 0
b) Reorder the code to reduce the likelihood of a stall occurring.
If a delayed branch was available for the ARM, the BGT could be replaced by a BGTD in the code above. Re-write the code to use the delayed branch (hint: you only need to move one instruction)
Imagine a 16-bit RISC CPU that requires two cycles per instruction (the first to fetch an instruction from RAM and the second to execute the instruction). Assume that each cycle requires 250ns so that each instruction completes in exactly 500ns. The CPU uses its main bus to write data to a soundcard: it is transferring 16-bit words to the soundcard at a rate of 44.1kHz. Since the audio data is using bus bandwidth, the CPU can not work at full speed.
a) What proportion of system bus capacity is free assuming that there is no DMA and that the audio data originates in RAM.
b) If the system uses DMA where 1 CPU instruction is issued to transfer each audio sample direct from RAM to the soundcard, what proportion of system bus capacity is now free?
c) Now assume the use of interrupts. An audio ISR contains 14 instructions, and is activated every 16 samples for us to write 16 samples to a FIFO buffer on the soundcard. The samples are transferred in a DMA block (with the DMA setup included in the ISR code). What proportion of system capacity is free now?
To implement a digital audio delay, a processor has to continuously read in audio samples, delay them, and output them some time later. For 16-bit audio at a sample rate of 8kHz, how many samples must the wait be for a 100ms delay? Implement this on the ADSP2181 using a circular buffer, and write the pseudo code using the instructions. Assume the buffer memory is empty at the beginning;
<reg>=IO(audioport) to read in data to register IO(audioport)=<reg> to read out data from register <reg>=DM(I0,M0) to get data from memory location pointed to by I0, and pointer I0 is incremented by the value in M0 after the operation DM(I0,M1)=<reg> stores value in register to memory location pointed to by I0, and after the operation, I0=I0+M1 I0=buffer_start sets I0 to point to a start of buffer in memory L0=buffer_end sets a circular buffer up (when I0 reaches L0, I0 is reset) M0=x sets address modifier M0 to contain a value of x M1=x sets address modifier M1 to contain a value of x B loop branches to the program label called loop
Section C: Design issues
A hardware stack is essential for a stack machine. Design a hardware stack for 4-bit data, using registers, counters, multiplexers, gates etc. as needed. It's depth is 8, so it should hold up to 8 data items.
What happens if you PUSH 9 values on to your stack? What happens if you POP 9 values off the stack? How can you modify this behaviour in your hardware design?
Consider a microcontroller that has no cache but a small amount of fast on-chip memory. You want to execute a real-time algorithm that reads two values from memory, and multiply-accumulates them. It loops this code 256 times. The off-chip memory operates on two wait-states. How many extra cycles are involved if off-chip memory is used instead of on-chip memory?
Generally, on-chip memory is also more efficient than off-chip memory. A typical value might be that a write to on-chip memory requires 2nJ of energy whilst a write to off-chip memory requires 10nJ. For a 3.3volt system, with a 20ns cycle time how much extra current would be required for continuous writes to off-chip memory in the previous question if the loop occurred every 10,000 cycles?
You are designing the next generation PC. The fast RISC CPU you chose has a 32-bit external data bus with a cycle time of 5ns, which you need to interface to some RAM. For maximum performance, the CPU has to be able to access the RAM with only 1 wait state. Look at the table of available RAM chips from SimLim Square and decide which one you would use, and why:
|
Vendor |
IC part number |
Size of data bus |
RAM speed |
Cost each (100,000s) |
|---|---|---|---|---|
|
Fuwell |
RAM08A |
8 |
5ns |
$0.28 |
|
Laser |
RAM08D |
8 |
12ns |
$0.18 |
|
Aquest |
RAM16B |
16 |
8ns |
$0.35 |
|
Aquest |
RAM16C |
16 |
10ns |
$0.33 |
|
Fuwell |
RAM32C |
32 |
10ns |
$0.67 |
|
WhiteBox |
RAM33E |
32 |
15ns |
$0.50 |
|
VideoPro |
RAM64C |
64 |
10ns |
$0.80 |
You are now designing a notebook computer. You need to reduce power consumption, so you decide to use a new low-power peripheral chip to control the IrDA and serial ports. The chip allows you to decide whether it should interrupt the processor, or be used in polling mode. Which would you choose and why?
In section 5.4 of the lecture notes (on interrupts), a worked example was given of the number of cycles that the ARM7 needed to respond to an IRQ request. Repeat the calculation for the fast interrupt, the FIQ. For a 250MHz ARM chip, how much quicker is the FIQ?