There are currently 2 generations of CPU core that implement the Alpha architecture:
Opinions differ as to what ``EV'' stands for (Editor's note: the true
answer is of course ``Electro Vlassic'' See section C.12), but the
number represents the first generation of Digital's CMOS technology
that the core was implemented in. So, the EV4 was originally
implemented in CMOS4. As time goes by, a CPU tends to get a mid-life
performance kick by being optically shrunk into the next generation of
CMOS process. EV45, then, is the EV4 core implemented in CMOS5
process. There is a big difference between shrinking a design into a
particular technology and implementing it from scratch in that
technology (but I don't want to go into that now). There are a few
other wildcards in here: there is also a CMOS4S (optical shrink in
CMOS4) and a CMOS5L.
True technophiles will be interested to know that CMOS4 is a 0.75 micron
process, CMOS5 is a 0.5 micron process and CMOS6 is a 0.35 micron process.
To map these CPU cores to chips we get:
The EV4 core is a dual-issue (it can issue 2 instructions per CPU
clock) superpipelined core with integer unit, floating point unit and
branch prediction. It is fully bypassed and has 64-bit internal data
paths and tightly coupled 8Kbyte caches, one each for Instruction and
Data. The caches are write-through (they never get dirty).
The EV45 core has a couple of tweaks to the EV4 core: it has a
slightly improved floating point unit, and 16KB caches, one each for
Instruction and Data (it also has cache parity). (Editor's note: Neal
Crook indicated in a separate mail that the changes to the floating
point unit (FPU) improve the performance of the divider. The EV4 FPU
divider takes 34 cycles for a single-precision divide and 63 cycles
for a double-precision divide (non data-dependent). In constrast, the
EV45 divider takes typically 19 cycles (34 cycles max) for
single-precision and typically 29 cycles (63 cycles max) for a
double-precision division (data-dependent).)
The EV5 core is a quad-issue core, also superpipelined, fully bypassed
etc etc. It has tightly-coupled 8Kbyte caches, one each for I and D. These
caches are write-through. It also has a tightly-coupled 96Kbyte on-chip
second-level cache (the Scache) which is 3-way set associative and write-back
(it can be dirty). The EV4->EV5 performance increase is better than just
the increase achieved by clock speed improvements. As well as the bigger
caches and quad issue, there are microarchitectural improvements to reduce
producer/consumer latencies in some paths.
The EV56 core is fundamentally the same microarchitecture as the
EV5, but it adds some new instructions for 8 and 16-bit loads and
stores (see Section C.8 Bytes and all that stuff.)
These are primarily intended for use by device drivers. The
EV56 core is implemented in CMOS6, which is a 2.0V process.
The 21064 was anounced in March 1992. It uses the EV4 core, with a 128-bit
bus interface. The bus interface supports the 'easy' connection of an external
second-level cache, with a block size of 256-bits (2 data beats on the
bus). The Bcache timing is completely software configurable. The 21064 can also
be configured to use a 64-bit external bus, (but I'm not sure if any shipping
system uses this mode). The 21064 does not impose any policy on the Bcache, but
it is usually configured as a write-back cache. The 21064 does contain hooks to
allow external hardware to maintain cache coherence with the Bcache and
internal caches, but this is hairy.
The 21066 uses the EV4 core and integrates a memory controller and
PCI host bridge. To save pins, the memory controller has a 64-bit data
bus (but the internal caches have a block size of 256 bits, just like
the 21064, therefore a block fill takes 4 beats on the bus). The
memory controller supports an external Bcache and external DRAMs. The
timing of the Bcache and DRAMs is completely software configurable,
and can be controlled to the resolution of the CPU clock
period. Having a 4-beat process to fill a cache block isn't as bad as
it sounds because the DRAM access is done in page mode. Unfortunately,
the memory controller doesn't support any of the new esoteric DRAMs
(SDRAM, EDO or BEDO) or synchronous cache RAMs. The PCI bus interface
is fully rev2.0 compliant and runs at upto 33MHz.
The 21164 has a 128-bit data bus and supports split reads, with
upto 2 reads outstanding at any time (this allows 100utilisation under best-case dream-on conditions, i.e., you can
theoretically transfer 128-bits of data on every bus clock). The 21164
supports easy connection of an external 3-rd level cache (Bcache) and
has all the hooks to allow external systems to maintain full cache
coherence with all caches. Therefore, symmetric multiprocessor designs
are 'easy'.
The 21164A was announced in October, 1995. It uses the EV56 core. It is
nominally pin-compatible with the 21164, but requires split power rails; all
of the power pins that were +3.3V power on the 21164 have now been split into
two groups; one group provided 2.0V power to the CPU core, the other group
supplies 3.3V to the I/O cells. Unlike older implementations, the 21164 pins
are not 5V-tolerant. The end result of this change is that 21164 systems are,
in general, not upgradeable to the 21164A (though note that it would be
relatively straightforward to design a 21164A system that could also
accommodate a 21164). The 21164A also has a couple of new pins to support
the new 8 and 16-bit loads and stores. It also improves the 21164 support for
using synchronus SRAMs to implement the external Bcache.