RISC (Reduced Instruction Set Computer)

Here's the basic point-form text, at least most of what I wrote on the chalkboard.

Goals

speed
Given modern sophisticated compilers, and the fact that people basically do not write directly in assembly language anymore, we make the machine language geared towards the hardware, not the programmer. In fact we may even be willing to end up with an assembly language which would be really painful for people to try to write in, because that doesn't matter, because we'll use high-level languages and compilers.
The goal of the following design choices is SPEED.
reduce number of cycles per instruction.

Goal: execute one instruction per cycle

Basic design ideas

no microcode
pipelining
Any operation that cannot be completed in one cycle is not included in the instruction set.
three-register instruction format
lots of registers
overlapping register windows

Pipelining

Instructions X, Y, and Z which do not affect control flow:

addr	instr
2	X
3	Y
4	JUMP 8
5
6
7
8	Z

Phases of instruction execution and some nicknames for them:

fetch & decode instruction (FI)
fetch operands (FO)
store result (S)

Traditionally, execute each phase in turn for each instruction.

FIX
FOX
SX
FIY
FOY
SY
FIJUMP
FOJUMP
SJUMP
FIZ
FOZ
SZ

But consider that once we've finished fetching the X instruction, during the decoding we're not using the memory path (MAR/MDR). We may or may not need the memory path to fetch the operands; let's assume we didn't. Then there's no reason we couldn't do "FOX" and "FIY" simultaneously.

FIX
FOX and FIY (simultaneously!)
SX and FOY and FIJUMP
SY and FOJUMP and FI5 (well, we don't know we're jumping yet)
SJUMP (oops that changes the PC, start over)
FIZ
FOZ and FI9
SZ and FO9 and FI10
...

Lots of issues.

Pipelining issues:

branch invalidates work in progress
- time lost (threatens one instruction per cycle!)
- have to reverse any operand-fetch side-effects (e.g. auto-increment)! (is this even always possible?)
multiple phases may need MAR/MDR
"load" (move memory to register) may require extra cycle
- By the time we've fetch any other kind of operand, for a load we've just got the effective address. We now need to put this into the MAR and do a Read etc.
problem if an operand is the result of the previous instruction

Some typical features of RISC CPUs (not all ubiquitous)

no complex addressing
load/store instruction set
separate instruction/data paths to memory
delayed load
operand forwarding
delayed branch

Delayed load

If a "LOAD" requires an extra cycle, we say that we simply may not access the load target in the next instruction!

We want:

ADD R1, R3
LOAD 500, R4
ADD R4, R3

we can't access R4 on that third line (it's an operand as well as the target)

could insert a NOP

better yet, rewrite:

LOAD 500, R4
ADD R1, R3
ADD R4, R3

Delayed branch

like delayed load but more bizarre

time	fetch	execute
0	X
1	Y	X
2	JUMP	Y
3	[5]	JUMP
4	Z
5	[9]	Z

At time 4, no instruction is executed. At time 3, we do a useless fetch, and then we squash it.

There are pipelined architectures in which we do fetch instruction Z at time 3...

But there's a better way, which is a more typical RISC approach. It's quite bizarre, but it involves no extra circuitry, in fact even less than squashing. But, it's quite bizarre from the assembly-language programming point of view and is another good reason to write in high-level languages only.

This is the delayed branch rule. What the delayed branch rule means is quite simply that the instruction following a branch, which will have been fetched and decoded, is EXECUTED, even if the branch is taken! So in this example, instruction 5 WILL be executed. As with the delayed load, we can just put a NOP there, but more usually we try to be clever. If that jump op is not conditional on the results of instruction Y, we can write:

addr instr

2 X

3 JUMP 8

4 Y

5

6

7

8 Z

and the instructions executed will be X, Y, Z. This special memory location following a branch is known as the "delayed branch slot", a place where you put one additional instruction to be executed. If the jump IS conditional on the result of instruction Y, perhaps we can do Y, then the conditional jump, then X. Failing everything, we use a NOP. But assuming we usually manage to fill the delayed branch slot with a useful instruction, we're managing to avoid squashing the pipeline on every jump, which saves a lot of time.

Note that we probably still have to squash upon interrupt.

Three-register instruction format

Typical	Typical RISC
ADD R1, R2	ADD R1, R2, R2
MOV R1, R3 ADD R2, R3	ADD R1, R2, R3

Lots of registers

Extra space on chip: no microprogram, and most circuitry is simpler. Registers really speed up stuff.

Digression on the use of registers by compilers required:

Programs in high-level languages do not refer to registers; compilers can decide to put variables in registers.

Of course we also need registers to be accumulators, that is, to do arithmetic in; and to store intermediate results. We also use them to pass parameters.

The following:

    i = x * y;
    j = x + 2;

can be compiled to something like:

    LOAD X, R1
    LOAD Y, R2
    MULT R1, R2
    STORE R2, I
    ADDI 2, R1
    STORE R1, J

The naive compilation of the high-level language code results in two loads of x. But we can keep x stashed in R1 and avoid re-loading it. We can also identify common sub-expressions (i.e. larger than a single variable) and use registers to keep their value for subsequent reuse.

All these uses of registers make a program run faster. The more registers you have, the more speedups a compiler (or human programmer) can produce by clever use of registers.

Register windows

Huge number of registers. Only some visible at a time. See example language below. JSR slides window one way; RTS slides it back.

Summary

The points we discussed are:

increase speed (design strategy #1 is the idea that speed is the most important thing, not ease of programming)
simple instructions
fixed-format instructions
no microcode
pipelining
no complex addressing
load/store architecture
three-register instruction format
register windows
other (a great many consequences of the above, e.g. pipelining leads to delayed branch)

Not all of these ideas are necessarily found in any one RISC CPU, but a CPU called a RISC will probably have most if not all of the above ideas.

In the other direction, just about any modern CPU does some pipelining, even if it can't do some of these other things because it has a backward-compatible instruction set.

Some of these ideas are arguably just as applicable to non-RISC CPUs, e.g. register windows. But register windowing is more likely to be found in a RISC because it's more likely to have more registers, because more chip space is available due to the lack of microcode and other complexities.

Example language

Here is a description of a hypothetical RISC machine language with some example code. I designed it to be severe along these lines. In practice, modern RISC machine languages are normally not quite so severe, but exhibit some of these strange characteristics.

References to "assignment three" in that document are actually a previous year's assignment three, but the examples are just as good even if they weren't assignment questions, I think. I can translate some of this year's assignment three solutions to this RISC machine language if people ask.

[list of course notes topics available] [main course page]