An answer:
It avoids "pipeline squash".
When we branch away, we have already fetched and decoded the word after the
branch instruction.
If we branch away and don't execute that word, we have to discard the work in
progress and start over with the destination word.
This loses a cycle because an instruction isn't ready to execute when we're
ready for it.
If instead we execute an instruction in the delayed branch slot, then we can
put the branch instruction one instruction earlier, and avoid the pipeline
squash, and still execute one instruction per cycle.
[exam]
[CSC 258 additional problems]
[main course page]