CPU Execution Time is determined by three terms:
CPU Execution Time = Instruction Count * CPI * Clock Cycle Time
CPI: Clock cycles Per Instruction, which is the average number of
clock cycles each instruction takes to execute.
Clock Cycle Time = Average time per instruction
In order to calculate the clock cycle time needed different instructions,
we have the following assumptions:
Clock Cycle Time :
R-format: 2ns +
1ns + 2ns
+ 1ns
= 6ns
instruction register ALU
write back to
fetch
read execution
register file
lw
: 2ns +
1ns + 2ns
+ 2ns +
1ns
= 8ns
instruction register ALU
Read Data write back to
fetch
read execution
Memory register file
sw
: 2ns +
1ns + 2ns
+ 2ns +
= 7ns
instruction register ALU
Write Data
fetch
read execution
Memory
beq
: 2ns +
1ns + 2ns
+
= 5ns
instruction register ALU
fetch
read execution
There are two possible implementations:
To compare the performance between two implementations, suppose
the following instruction distribution:
24% lw, 12% sw, 44% R-format, 20% beq
For the fixed clock length, the Clock Cycle Time should at least be
equal
to the longest clock cycle times used by the instructions, i.e., 8ns
for lw.
Clock Cycle Time (fixed) = 8ns
For the variable clock length, we can calculate the average value:
Clock Cycle Time (variable) = 8*24% + 7*12% + 6*44% + 5 * 20% = 6.4 ns
We use the same Instruction sets to test so that the Instruction Count
is the same. CPI = 1 for both implementations. We can get
Performance (variable)
CPU execution time (fixed)
---------------- = -----------------------
Performance (fixed)
CPU execution time (variable)
Clock Cycle Time (fixed)
= -----------------------
Clock Cycle Time (variable)
= 8/6.4 = 1.25
This indicates the variable clock implementation is 1.25 times
faster
than the the fixed clock implementation. Keep in mind, here we only
consider a simple instruction set. For a more complex instruction set
including float point instructions, the performance of single cycle
with
fixed clock cycle length will be even worse.
Suppose cloth washing have to go through the following
four steps:
wasker ---> dryer ---> folder ---> storer
0.5 hour
0.5 hour 0.5 hour
0.5 hour
Fig 6.1 shows two approaches to laundry was. One is sequential
laundry (non-pipeline) approach that takes
2 * 4 =
8 hours
to wash 4 loads. In comparison, the pipeline approach takes only
2 + 3*0.5 =
3.5 hours.
to wash four loads. So the pipeline approach is more than
2 time faster than non-pipeline approach for the task of 4 load
wash.
In fact, if all the sages take about the same amount of time and
there is enough work to do, then the speedup due to pipelining
is equal to the number of stages in the pipeline. Supposing there
are 1000 wash loads, non-pipeline approach takes
2*1000 = 2000 hours
while pipeline approach only takes
2 + 1000*0.5 = 502 hours.
The speedup
2000/502 approximately equal to 4.
Five-stage pipelined datapath shown in Fig. 6.10
in the text book:
1. IF : Instruction fetch
2. ID : Instruction decode and
register file read
3. EX : ALU execution
4. MEM: Data memory access
5. WB : Write back.
Since at most five instructions can be in the datapath at
the same time for five-satge datapath, we need to save information
needed by each instructions. For example, if we did not save one
instruction bits, the following instruction entering the datapath
will re-write the previous instruction. All the information for
the previous one will be lost.
Similar to that PC (program counter) passes the instruction
address from one clock cycle to the next clock cycle, we can
insert registers between two stages, shown in Fig.
6.12 in the text book.
IF/ID registers: PC address, instruction
ID/EX registers: PC address, Read Data 1, Read Data 2, sign-extended offset
EX/MEM registers: branch address, Zero signal, ALU result, Read Data 2
MEM/WB registers: Read Data from Data Memory, ALU result