John Mashey on RISC

Most of you have seen most of this several times before; there is a little editing, nothing substantial. Some followup comments have been added.

PART I - ARCHITECTURE, IMPLEMENTATION, DIFFERENCES

WARNING: you may want to print this one to read it...

Anyway, it is not a fair comparison. Not by a long stretch. Let's see how the Nth generation SPARC, MIPS, and 88K's do (assuming they last) compared to some new design from scratch.

Well, there is baggage and there is BAGGAGE. One must be careful to distinguish between ARCHITECTURE and IMPLEMENTATION:

Architectures persist longer than implementations, especially user-level Instruction-Set Architecture.
The first member of an architecture family is usually designed with the current implementation constraints in mind, and if you're lucky, software people had some input.
If you're really lucky, you anticipate 5-10 years of technology trends, and that modifies your idea of the ISA you commit to.
It's pretty hard to delete anything from an ISA, except where:
1. You can find that NO ONE uses a feature (the 68020->68030 deletions mentioned by someone else).
2. You believe that you can trap and emulate the feature "fast enough".
  i.e., microVAX support for decimal ops, 68040 support for transcendentals.

Now, one might claim that the i486 and 68040 are RISC implementations of CISC architectures... and I think there is some truth to this, but I also think that it can confuse things badly:

Anyone who has studied the history of computer design knows that high-performance designs have used many of the same techniques for years, for all of the natural reasons, that is:

They use as much pipelining as they can, in some cases, if this means a high gate-count, then so be it.
They use caches (separate I & D if convenient).
They use hardware, not micro-code for the simpler operations.

(For instance, look at the evolution of the S/360 products. Recall that the 360/85 used caches, back around 1969, and within a few years, so did any mainframe or supermini.)

So, what difference is there among machines if similar implementation ideas are used?

A: there is a very specific set of characteristics shared by most machines labeled RISCs, most of which are not shared by most CISCs. The RISC characteristics:

Are aimed at more performance from current compiler technology (i.e., enough registers).
OR Are aimed at fast pipelining
- in a virtual-memory environment
- with the ability to still survive exceptions
- without inextricably increasing the number of gate delays (notice that I say gate delays, NOT just how many gates).

Even though various RISCs have made various decisions, most of them have been very careful to omit those things that CPU designers have found difficult and/or expensive to implement, and especially, things that are painful, for relatively little gain.

I would claim, that even as RISCs evolve, they may have certain baggage that they'd wish weren't there... but not very much. In particular, there are a bunch of objective characteristics shared by RISC ARCHITECTURES that clearly distinguish them from CISC architectures.

I'll give a few examples, followed by the detailed analysis:

MOST RISCs:

3a) Have 1 size of instruction in an instruction stream
3b) And that size is 4 bytes
3c) Have a handful (1-4) addressing modes) (* it is VERY hard to count these things; will discuss later).
3d) Have NO indirect addressing in any form (i.e., where you need one memory access to get the address of another operand in memory)
4a) Have NO operations that combine load/store with arithmetic, i.e., like add from memory, or add to memory. (note: this means especially avoiding operations that use the value of a load as input to an ALU operation, especially when that operation can cause an exception. Loads/stores with address modification can often be OK as they don't have some of the bad effects)
4b) Have no more than 1 memory-addressed operand per instruction
5a) Do NOT support arbitrary alignment of data for loads/stores
5b) Use an MMU for a data address no more than once per instruction
6a) Have >=5 bits per integer register specifier
6b) Have >= 4 bits per FP register specifier

These rules provide a rather distinct dividing line among architectures, and I think there are rather strong technical reasons for this, such that there is one more interesting attribute: almost every architecture whose first instance appeared on the market from 1986 onward obeys the rules above .....

Note that I didn't say anything about counting the number of instructions....

So, here's a table:

C: number of years since first implementation sold in this family (or first thing which with this is binary compatible).
Note: this table was first done in 1991, so year = 1991-(age in table).
3a: # instruction sizes
3b: maximum instruction size in bytes
3c: number of distinct addressing modes for accessing data (not jumps) I didn't count register or literal, but only ones that referenced memory, and I counted different formats with different offset sizes separately. This was hard work... Also, even when a machine had different modes for register-relative and PC_relative addressing, I counted them only once.
3d: indirect addressing: 0: no, 1: yes
4a: load/store combined with arithmetic: 0: no, 1:yes
4b: maximum number of memory operands
5a: unaligned addressing of memory references allowed in load/store, without specific instructions
- 0: no never (MIPS, SPARC, etc)
- 1: sometimes (as in RS/6000)
- 2: just about any time
5b: maximum number of MMU uses for data operands in an instruction
6a: number of bits for integer register specifier
6b: number of bits for 64-bit or more FP register specifier, distinct from integer registers

Note that all of these are ARCHITECTURE issues, and it is usually quite difficult to either delete a feature (3a-5b) or increase the number of real registers (6a-6b) given an initial isntruction set design. (yes, register renaming can help, but...)

Now: items 3a, 3b, and 3c are an indication of the decode complexity 3d-5b hint at the ease or difficulty of pipelining, especially in the presence of virtual-memory requirements, and need to go fast while still taking exceptions sanely items 6a and 6b are more related to ability to take good advantage of current compilers. There are some other attributes that can be useful, but I couldn't imagine how to create metrics for them without being very subjective; for example "degree of sequential decode", "number of writebacks that you might want to do in the middle of an instruction, but can't, because you have to wait to make sure you see all of the instruction before committing any state, because the last part might cause a page fault," or "irregularity/assymetricness of register use", or "irregularity/complexity of instruction formats". I'd love to use those, but just don't know how to measure them. Also, I'd be happy to hear corrections for some of these.

So, here's a table of 12 implementations of various architectures, one per architecture, with the attributes above. Just for fun, I'm going to leave the architectures coded at first, although I'll identify them later. I'm going to draw a line between H1 and L4 (obviously, the RISC-CISC Line), and also, at the head of each column, I'm going to put a rule, which, in that column, most of the RISCs obey. Any RISC that does not obey it is marked with a +; any CISC that DOES obey it is marked with a *. So...

	1991
CPU	Age	3a 3b 3c 3d	4a 4b 5a 5b	6a 6b	# ODD
RULE	<6	=1 =4 <5 =0	=0 =1 <2 =1	>4 >3
-------------------------------------------------------------------------
A1	4	 1  4  1  0	 0  1  0  1	 8  3+	1
B1	5	 1  4  1  0	 0  1  0  1	 5  4	-
C1	2	 1  4  2  0	 0  1  0  1	 5  4	-
D1	2	 1  4  3  0	 0  1  0  1	 5  0+	1
E1	5	 1  4 10+ 0	 0  1  0  1	 5  4	1
F1	5	 2+ 4  1  0	 0  1  0  1	 4+ 3+	3
G1	1	 1  4  4  0	 0  1  1  1	 5  5   -
H1	2	 1  4  4  0	 0  1  0  1	 5  4	-	RISC
---------------------------------------------------------------
L4	26	 4  8  2* 0*	 1  2  2  4	 4  2	2	CISC
M2	12	12 12 15  0*	 1  2  2  4	 3  3	1
N1	10	21 21 23  1	 1  2  2  4	 3  3	-
O3	11	11 22 44  1	 1  2  2  8	 4  3	-
P3	13	56 56 22  1	 1  6  2 24	 4  0	-

An interesting exercise is to analyze the ODD cases. First, observe that of 12 architectures, in only 2 cases does an architecture have an attribute that puts it on the wrong side of the line. Of the RISCs:

A1 is slightly unusual in having more integer registers, and less FP than usual. [Actually, slightly out of date, 29050 is different, using integer register bank instead, I hear.]
D1 is unusual in sharing integer and FP registers (that's what the D1:6b == 0).
E1 seems odd in having a large number of address modes. I think most of this is an artifact of the way that I counted, as this architecture really only has a fundamentally small number of ways to create addresses, but has several different-sized offsets and combinations, but all within 1 4-byte instruction; I believe that it's addressing mechanisms are fundamentally MUCH simpler than, for example, M2, or especially N1, O3, or P3, but the specific number doesn't capture it very well.
F1 ... is not sold any more.
H1 one might argue that this process has 2 sizes of instructions, but I'd observe that at any point in the instruction stream, the instructions are either 4-bytes long, or 8-bytes long, with the setting done by a mode bit, i.e., not dynamically encoded in every instruction.

Of the processors called CISCs:

L4 happens to be one in which you can tell the length of the instruction from the first few bits, has a fairly regular instruction decode, has relatively few addressing modes, no indirect addressing. In fact, a big subset of its instructions are actually fairly RISC-like, although another subset is very CISCy.
M2 has a myriad of instruction formats, but fortunately avoided indirect addressing, and actually, MOST of instructions only have 1 address, except for a small set of string operations with 2. I.e., in this case, the decode complexity may be high, but most instructions cannot turn into multiple-memory-address-with-side-effects things.
N1, O3, and P3 are actually fairly clean, orthogonal architectures, in which most operations can consistently have operands in either memory or registers, and there are relatively few weirdnesses of special-cased uses of registers. Unfortunately, they also have indirect addressing, instruction formats whose very orthogonality almost guarantees sequential decoding, where it's hard to even know how long an instruction is until you parse each piece, and that may have side-effects where you'd like to do a register write-back early, but either:
- must wait until you see all of the instruction until you commit state
- or, must have "undo" shadow-registers
- or, must use instruction-continuation with fairly tricky exception handling to restore the state of the machine

It is also interesting to note that the original member of the family to which O3 belongs was rather simpler in some of the critical areas, with only 5 instruction sizes, of maximum size 10 bytes, and no indirect addressing, and requiring alignment (i.e., it was a much more RISC-like design, and it would be a fascinating speculation to know if that extra complexity was useful in practice). Now, here's the table again, with the labels:

	1991
CPU	Age	3a 3b 3c 3d	4a 4b 5a 5b	6a 6b	# ODD
RULE	<6	=1 =4 <5 =0	=0 =1 <2 =1	>4 >3
-------------------------------------------------------------------------
A1	4	 1  4  1  0	 0  1  0  1	 8  3+	1	AMD 29K
B1	5	 1  4  1  0	 0  1  0  1	 5  4	-	R2000
C1	2	 1  4  2  0	 0  1  0  1	 5  4	-	SPARC
D1	2	 1  4  3  0	 0  1  0  1	 5  0+	1	MC88000
E1	5	 1  4 10+ 0	 0  1  0  1	 5  4	1	HP PA
F1	5	 2+ 4  1  0	 0  1  0  1	 4+ 3+	3	IBM RT/PC
G1	1	 1  4  4  0	 0  1  1  1	 5  5   -	IBM RS/6000
H1	2	 1  4  4  0	 0  1  0  1	 5  4	-	Intel i860
---------------------------------------------------------------
L4	26	 4  8  2* 0*	 1  2  2  4	 4  2	2	IBM 3090
M2	12	12 12 15  0*	 1  2  2  4	 3  3	1	Intel i486
N1	10	21 21 23  1	 1  2  2  4	 3  3	-	NSC 32016
O3	11	11 22 44  1	 1  2  2  8	 4  3	-	MC 68040
P3	13	56 56 22  1	 1  6  2 24	 4  0	-	VAX

General comment: this may sound weird, but in the long term, it might be easier to deal with a really complicated bunch of instruction formats, than with a complex set of addressing modes, because at least the former is more amenable to pre-decoding into a cache of decoded instructions that can be pipelined reasonably, whereas the pipeline on the latter can get very tricky (examples to follow). This can lead to the funny effect that a relatively "clean", orthogonal archiecture may actually be harder to make run fast than one that is less clean. Obviously, every weirdness has it's penalties... But consider the fundamental difficulty of pipelining something like (on a VAX):

    ADDL @(R1)+,@(R1)+,@(R2)+

I.e., something that, might theoretically arise from:

	register **r1, **r2;
	**r2++ = **r1++ + **r1++;

Now, consider what the VAX has to do:

Decode the opcode (ADD)

Fetch first operand specifier from I-stream and work on it.

	a) Compute the memory address from (r1)
		If aligned
			run through MMU
				if MMU miss, fixup
			access cache
				if cache miss, do write-back/refill
		Elseif unaligned
			run through MMU for first part of data
				if MMU miss, fixup
			access cache for that part of data
				if cache miss, do write-back/refill
			run through MMU for second part of data
				if MMU miss, fixup
			access cache for second part of data
				if cache miss, do write-back/refill
		Now, in either case, we now have a longword that has the
		address of the actual data.
	b) Increment r1  [well, this is where you'd LIKE to do it, or
	in parallel with step 2a).]  However, see later why not...
	c) Now, fetch the actual data from memory, using the address just
	obtained, doing everything in step 2a) again, yielding the
	actual data, which we needto stick in a temporary buffer, since it
	doesn't actually go in a register.

Now, decode the second operand specifier, which goes thru everything that we did in step 2, only again, and leaves the results in a second temporary buffer. Note that we'd like to be starting this before we get done with all of 2 (and I THINK the VAX9000 probably does that??) but you have to be careful to bypass/interlock on potential side-effects to registers .... actually, you may well have to keep shadow copies of every register that might get written in the instruction, since every operand can use auto-increment/decrement. You'd probably want badly to try to compute the address of the second argument and do the MMU access interleaved with the memory access of the first, although the ability of any operand to need 2-4 MMU accesses probably makes this tricky. [Recall that any MMU access may well cause a page fault....]
Now, do the add. [could cause exception]
Now, do the third specifier... only, it might be a little different, depending on the nature of the cache, that is, you cannot modify cache or memory, unless you know it will complete. (Why? well, suppose that the location you are storing into overlaps with one of the indirect-addressing words pointed to by r1 or 4(r1), and suppose that the store was unaligned, and suppose that the last byte of the store crossed a page boundary and caused a page fault, and that you'd already written the first 3 bytes. If you did this straightforwardly, and then tried to restart the instruction, it wouldn't do the same thing the second time.
When you're sure all is well, and the store is on its way, then you can safely update the two registers, but you'd better wait until the end, or else, keep copies of any modified registers until you're sure it's safe. (I think both have been done ??)
You may say that this code is unlikely, but it is legal, so the CPU must do it. This style has the following effects:
1. You have to worry about unlikely cases.
2. You'd like to do the work, with predictable uses of functional units, but instead, they can make unpredictable demands.
3. You'd like to minimize the amount of buffering and state, but it costs you in both to go fast.
4. Simple pipelining is very, very tough: for example, it is pretty hard to do much about the next instruction following the ADDL, (except some early decode, perhaps), without a lot of gates for special-casing. (I've always been amazed that CVAX chips are fast as they are, and VAX 9000s are REALLY impressive...)
5. EVERY memory operand can potentially cause 4 MMU uses, and hence 4 MMU faults that might actually be page faults...
6. AND there are even worse cases, like the addp6 instruction, that can require *40* pages to be resident to complete...
Consider how "lazy" RISC designers can be:
1. Every load/store uses exactly 1 MMU access.
2. The compilers are often free to re-arrange the order, even across what would have been the next instruction on a CISC. This gets rid of some stalls that the CISC may be stuck with (especially memory accesses).
3. The alignment requirement avoids especially the problem with sending the first part of a store on the way before you're SURE that the second part of it is safe to do.

Finally, to be fair, let me add the two cases that I knew of that were more on the borderline: i960 and Clipper:

CPU	Age	3a 3b 3c 3d	4a 4b 5a 5b	6a 6b	# ODD
RULE	<6	=1 =4 <5 =0	=0 =1 <2 =1	>4 >3
-------------------------------------------------------------------------
J1	5	 4+ 8+ 9+ 0      0  1  0  2      4+ 3+	5	Clipper
K1	3	 2+ 8+ 9+ 0	 0  1  2+ -      5  3+	5	Intel 960KB

(I think an ARM would be in this area as well; I think somebody once sent me an ARM-entry, but I can't find it again; sorry.)

Note: slight modification (I'll integrate this sometime):

From j...@MIT.EDU  Mon Nov 29 12:59:55 1993
Subject: Re: Why are Motorola's slower than Intel's ? [really what's a RISC]
Newsgroups: comp.arch
Organization: Massachusetts Institute of Technology

Since you made your table IBM has released a couple chips that support
unaligned accesses in hardware even across cache line boundaries and
may store part of an unaligned object before taking a page fault on
the second half, if the object crosses a page boundary.

These are the RSC (single chip POWER) and PPC 601 (based on RSC core).
    John Carr (j...@mit.edu)

(Back to me; jfc's comments are right; if I had time, I'd add another line to do PPC ... which, in some sense replays the S/360 -> S/370 history of relaxing alignment restrictions somewhat. I conejcture that at least some of this was done to help Apple s/w migration.)

SUMMARY:

RISCs share certain architectural characteristics, although there are differences, and some of those differences matter a lot.
However, the RISCs, as a group, are much more alike than the CISCs as a group.
At least some of these architectural characteristics have fairly serious consequences on the pipelinability of the ISA, especially in a virtual-memory, cached environment.
Counting instructions turns out to be fairly irrelevant:
1. It's HARD to actually count instructions in a meaningful way... (if you disagree, I'll claim that the VAX is RISCier than any RISC, at least for part of its instruction set :-) Why: VAX has a MOV opcode, whereas RISCs usually have a whole set of opcodes for {LOAD/STORE} {BYTE, HALF, WORD}
2. More instructions aren't what REALLY hurts you, anywhere near as much features that are hard to pipeline:
3. RISCs can perfectly well have string-support, or decimal arithmetic support, or graphics transforms... or lots of strange register-register transforms, and it won't cause problems... but compare that with the consequence of adding a single instruction that has 2-3 memory operands, each of which can go indirect, with auto-increments, and unaligned data...

PART II - ADDRESSING MODES

I promised to repost this with fixes, and people have been asking for it, so here it is again: if you saw it before, all that's really different is some fixes in the table, and a few clarified explanations:

THE GIANT ADDDRESSING MODE TABLE (Corrections happily accepted) This table goes with the higher-level table of general architecture characteristics.

Address mode summary
r	register
r+	autoincrement (post)	[by size of data object]
-r	autodecrement (pre)	[by size,...and this was the one I meant]
>r	modify base register	[generally, effective address -> base]
				NOTE: sometimes this subsumes r+, -r, etc,
				and is more general, so I categorize it
				as a separate case.

d	displacement		d1 & d2 if 2 different displacements
x	index register
s	scaled index
a	absolute	[as a separate mode, as opposed to displacement+(0)
I	Indirect

Shown below are 22 distinct addressing modes [you can argue whether these are right categories]. In the table are the *number* of different encodings/variations [and this is a little fuzzy; you can especially argue about the 4 in the HP PA column, I'm not even sure that's right]. For example, I counted as different variants on a mode the case where the structure was the same, but there were different-sized displacements that had to be decoded. Note that meaningfully counting addressing modes is *at least as bad* as meaningfully counting opcodes; I did the best I could, and I spect a lot of hours looking at manuals for the chips I hadn't programmed much, and in some cases, even after hours, it was hard for me to figure out meaningful numbers... *Most* of these archiectures are used in general-purpose systems and *most* have at least one version that uses caches: those are important because many of the issues in thinking about addressing modes come from their interactions with MMUs and caches...

	1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20  21  22
							             r   r
						           r  r  r   +d1 +d1
	            r  r  r |              |   r  r |   r  r+ +d +d1 I   +s
	   r  r  r  +d +x +s|         s+ s+|s+ +d +d|r+ +d I  I  I   +s  I  
	r  +d +x +s >r >r >r|r+ -r a  a  r+|-r +x +s|I  I  +s +s +d2 +d2 +d2
	-- -- -- -- -- -- --|-- -- -- -- --|-- -- --|-- -- -- -- --- --- ---	
AMD 29K	 1		    |		   |	    |   
Rxxx	    1		    |		   |	    |   
SPARC	    1  1  	    |		   |	    |   
88K         1  1  1	    |		   |	    |   
HP PA       2  1  1  4  1  1|		   |	    |   
ROMP     1  2		    |		   |	    |    
POWER       1  1     1  1   |		   |	    |    
i860        1  1     1  1   |		   |	    |    
Swrdfish 1  1  1	    |       1	   |	    |    
ARM      2  2     2  1     1| 1  1
Clipper  1  3  1            | 1  1  2      |	    |    
i960KB   1  1  1  1  	    |       2  2   |    1   |    

S/360       1  		    |		        1   |    
i486     1  3  1  1	    | 1  1  2      |    2  3|   
NSC32K      3		    | 1  1  3  3   |   	   3|    	  9 	  
MC68000  1  1		    | 1  1  2	   |	2   |    								
MC68020  1  1		    | 1  1  2	   |	2  4|  	      	      16  16
VAX	 1  3     1	    | 1  1  1  1  1| 1     3| 1  3  1  3

COLUMN NOTES:

Columns 1-7 are addressing modes used by many machines, but very few, if any clearly-RISC architectures use anything else. They are all characterized by what they don't have:
- 2 adds needed before generating the address
- indirect addressing
- variable-sized decoding
Columns 13-15 include fairly simple-looking addressing modes, which however, *may* require 2 back-to-back adds beforet he address is available. [*may* because some of them use index-register=0 or something to avoid indexing, and usually in such machines, you'll see variable timing figures, depending on use of indexing.]
Columns 16-22 use indirect addressing.

ROW NOTES

Clipper & i960, of current chips, are more on the RISC-CISC border, or are sort of "modern CISCs". ARM is also characterized (by ARM people, Hot Chips IV: "ARM is not a "pure RISC".
ROMP has a number of characteristics different from the rest of the RISCs, you might call it "early RISC", and it is of course no longer made.
You might consider HP PA a little odd, as it appears to have more addressing modes, in the same way that CISCs do, but I don't think this is the case: it's an issue of whether you call something several modes or one mode with a modifier, just as there is trouble counting opcodes (with & without modifiers). From my view, neither PA nor POWER have truly "CISCy" addressing modes.
Notice difference between 68000 and 68020 (and later 68Ks): a bunch of incredibly-general & complex modes got added...
Note that the addressing on the S/360 is actually pretty simple, mostly base+displacement, although RX-addressing does take 2 regs+offset.
A dimension *not* shown on this particular chart, but also highly relevant, is that this chart shows the different *types* of modes, *not* how many addresses can be found in each instruction. That may be worth noting also:
```
	AMD : i960	1	one address per instruction
	S/360 - MC68020	2	up to 2 addresses
	VAX		6	up to 6
```

By looking at alignment, indirect addressing, and looking only at those chips that have MMUs, consider the number of times an MMU *might* be used per instruction for data address translations:

	AMD - Clipper	2		[Swordfish & i960KB: no TLB]
	S/360 - NSC32K	4
	MC68Ks (all)	8
	VAX		24

When RS/6000 does unaligned, it must be in the same cache line (and thus also in same MMU page), and traps to software otherwise, thus avoiding numerous ugly cases.

Note: in some sense, S/360s & VAXen can use an arbitrary number of translations per instruction, with MOVE CHARACTER LONG, or similar operations & I don't count them as more, because they're defined to be interruptable/restartable, saving state in general-purpose registers, rather than hidden internal state.

SUMMARY:

Computer design styles mostly changed from machines with:
- 2-6 addresses per instruction, with variable sized encoding
- address specifiers were usually "orthogonal", so that any could ggo anywhere in an instruction
- sometimes indirect addressing
- sometimes need 2 adds *before* effective address is available
- sometimes with many potential MMU accesses (and possible exceptions) per instruciton, often buried in the middle of the instruction, and often *after* you'd normally want to commit state because of auto-increment or other side effects.
to machines with:
- 1 address per instruction
- address specifiers encoded in small # of bits in 32-bit instruction
- no indirect addressing
- never need 2 adds before address available
- use MMU once per data access
and we usually call the latter group RISCs. I say "changed" because if you put this table together with the earlier one, which has the age in years, the older ones were one way, and the newer ones are different.
Now, ignoring any other features, but looking at this single attribute (architectural addressing features and implementation effects therof), it ought to be clear that the machines in the first part of the table are doing something *technically* different from those in the second part of the table. Thus, people may sometimes call something RISC that isn't, for marketing reasons, but the people calling the first batch RISC really did have some serious technical issues at heart.
One more time: this is *not* to say that RISC is better than CISC, or that the few in the middle are bad, or anything like that ... but that there are clear technical characteristics...

PART III - MORE ON TERMINOLOGY; WOULD YOU CALL THE CDC 6600 A RISC?

In article <2nii0d$k...@crl2.crl.com>, dben...@crl.com (Andrea Chen) writes:

|> You may be correct on the creation of the term, but RISC does |> refer to a school of computer design that dates back to the |> early seventies.

This is all getting fairly fuzzy and subjective, but it seems very confusing to label RISC as a school of thought that dates back to the early 1970s.

One can say that RISC is a school of thought that got popular in the early-to-mid 80's, and got widespread commercial use then.
One can say that there were a few people (like John Cocke & co at IBM) who were doing RISC-style research projects in the mid-70s.
But if you want to go back, as has been discussed in this newsgroup often, a lot of people go back to the CDC 6600, whose design started in 1960, and was delivered in 4Q 1964. Now, while this wouldn't exactly fit the exact parameters of current RISCs, a great deal of the RISC-style approach was there in the central processor ISA:
1. Load/store architecture.
2. 3-address register-register instructions
3. Simply-decoded instruction set
4. Early use of instructions schedule by compiler, expectation that you'd usually program in high-level language and not often resort to assembler, as you'd expect compiler to do well.
5. More registers than common at the time
6. ISA designed to make decode/issue easy
Note that the 360/91 (1967) offered a good example of building a CISC-architecture into a high-performance machine, and was an interesting comparison to the 6600.
Maybe there is some way to claim that RISC goes back to the 1950s, but in general, most machines of the 1950s and 1960s don't feel very RISCy (to me). Consider Burroughs B5000s; IBM 709x, 707x, 1401s; Univac 110x; GE 6xx, etc, and of course, S/360s. Simple load/store architectures were hard to find; there were often exciting instruction decodings required; indirect addressing was popular; machines often had very few accumulators.
If you want to try sticking this in the matrix I've published before, as best as I recall, the 6600 ISA generally looked like:
```
CPU		3a 3b 3c 3d	4a 4b 5a 5b	6a 6b	# ODD
RULE		=1 =4 <5 =0	=0 =1 <2 =1	>4 >3
-------------------------------------------------------------------------
CDC 6600	 2  *  1  0	 0  1  0  1	 3  3	4 (but  ~1 if fair)
```
That is:
2: it has 2 instruction sizes (not 1), 15 & 30 bits (however, were packed into 60-bit words, so if you had 15, 30, 30, the second 30-bitter would not cross word boundaries, but would start in the second word.)
*: 15-and-30 bit instructions, not 32-bit.
1: 1 addressing mode [Note: Time McCaffrey emailed me that one might consider there to be more, i.e., you could set address register to combinations of the others to give autoincrement/decrement/Index+offset, etc). In any case, you compute an address as a simpel combination of 1-2 registers, andthen use the address, without furhter side-effects.
0: no indirect addressing
1: have one memory operand per instruction
0: do NOT support arbitrary alignment of operands in memory (well, it was a word-addressed machine :-)
1: use an MMU for data translation no more than once per instruction (MMU used loosely here)
3,3: had 3-bit fields for addressing registers, both index and FP

Now, of the 10 ISA attributes I'd proposed for identifying typical RISCs, the CDC 6600 obeys 6. It varies in having 2 instruction formats, and in having only 3 bits for register fields, but it had simple packing of the instructions in to fixed-size words, and register/accumulators were pretty expensive in those days (some popular machines only had one accumulator and a few index registers, so 8 of each was a lot). Put another way: it had about as many registers as you'd conveniently build in a high-speed machine, and while they packed 2-4 operations into a 60-bit word, the decode was pretty straighforward. Anyway, given the caveats, I'd claim that the 6600 would fit much better in the RISC part of the original table...

PART IV - RISC, VLIW, STACKS

In article <35a1a3$m...@doc.armltd.co.uk>, Clive...@armltd.co.uk writes:

Really? The Venerable John Mashey's table appears to contain as many exceptions to the rule about number of GP registers as most others. I'm sure if one were to look at the various less conventional processors, there would be some clearly RISC processors that didn't have a load-store architecture - stack and VLIW processors spring to mind.

I'm not sure I understand the point. One can believe any of several things:

One can believe RISC is some marketing term without technical meaning whatsoever. OR
One can believe that RISC is some collection of implementation ideas. This is the msot common confusion.
One can believe that RISC has some ISA meaning (such as RISC == small number of opcodes) ... but have a different idea of RISC than do most chip architects who build them. If you want to pay words extra money every Friday to mean something different than what they mean to practitioners ... then you are free to do so, but you will have difficulty communicating with practitioners if you do so.
EX: I'm not sure how stack architectures are "clearly RISC" (?) Maybe CRISP, sort of. Burroughs B5000 or Tandem's original ISA: if those are defined as RISC, the term has been rendered meaningless.
EX: VLIWs: I don't know any reason why I'd call VLIWs, in general, either clearly RISC or clearly not. VLIW is a technique for issuing instructions to more functional units than you have the die space/cycle time to decode more dynamically. There gets to be a fuzzy line between:
1. A VLIW, especially if it compresses instructions in memory, then expands them otu when brought into the cache.
2. A superscalar RISC, which does some predecoding on the way from memory->cache, adding "hint" bits or rearranging what it keeps there, speeding up cache->decode->issue.
At least some VLIWs are load/store architectures, and the operations they do look usually look like typical RISC operations. OR, you can believe that:
RISC is a term used to characterize a class of relatively-similar ISAs mostly developed in the 1980s. Thus, if a knowledgable person looks at ISAs, they will tend to cluster various ISAs as:
1. Obvious RISC, fits the typical rules with few exceptions.
2. Obviously not-RISC, fits the inverse of the RISC rules with relatively few exceptions. Sometimes people call this CISC ... but whereas RISCs, as a group, have realitvely similar ISAs, the CISC label is sometimes applied to a widely varying st of ISAs.
3. Hybrid / in-the-middle cases, that either look like CISCy RISCs, or RISCy CISCs. There are a few of these.
  Cases 1-3 are appropriate may apply to reasonably contemporaneous processors, and make some sense. and then
4. CPUs for which RISC/CISC is probably not a very relevant classification. I.e., one can apply the set of rules I've suggested, and get an exception-count, but it may not mean much in practice, especially when applied to older CPUs created with vastly different constraints than current ones, or embedded processors, or specialized ones. Sometimes an older CPU might have been designed with some similar philosophies (i.e., like CDC 6600 & RISC, sort of) whether or not it happend to fit the rules. Sometimes, die-space constraints my have led to "simple" chips, without making them fit the suggested criteria either. personally, torturous arguments about whether a 6502, or a PDP-8, or a 360/44 or an XDS Sigma 7, etc, are RISC or CISC ... do not usually lead to great insight. After a while such arguments are counting angels dancing on pinheads ("Ahh, only 10 angles, must be RISC" :-).
  In this belief space, one tends to follow Hennessy & Patterson's comment in E.9 that "In the history of computing, there has never been such widespread agreement on computer architecture." None of this pejorative of earlier architectures, just the observation that the ISAs newly-developed in the 1980s were far more similar that the earlier groups of ISAs. [I recall a 2-year period in which I used IBM 1401, IBM 7074, IBM 7090, Univac 1108, and S/360, of which only the 7090 and 1108 bore even the remotest resemblance to each other, i.e., at least they both had 36-bit words.]

Summary: RISC is a label most commonly used for a set of ISA characteristics chosen to ease the use of aggressive implementation techniques found in high-performance processors (regardless of RISC, CISC, or irrelevant). This is a convenient shorthand, but that's all, although it probably makes sense to use the term the way it's usually meant by people who do chips for a living.

-- 
-john mashey    DISCLAIMER: 
UUCP:    ma...@sgi.com 
DDD:    415-390-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

Navigation

Topics

John Mashey on RISC

PART I - ARCHITECTURE, IMPLEMENTATION, DIFFERENCES

PART II - ADDRESSING MODES

PART III - MORE ON TERMINOLOGY; WOULD YOU CALL THE CDC 6600 A RISC?

PART IV - RISC, VLIW, STACKS