The CP itself only handled mathematical and logic operations, while the PPs performed all of the memory and input/output tasks. Since the CP was handling a much smaller subset of operations, it could be run faster. The other important element was the switch from thermionic valves (vacuum tubes) to transistors, which offered faster switching speeds. These factors taken together meant that the CDC 6600's CPU could run at 10MHz while other supercomputers of the day were operating at around 1MHz.
Since memory at that time was around 10 times faster than most supercomputer CPUs, the CDC 6600's architecture ensured that operations took full advantage of the bandwidth. The CDC 6600's PPs were each allowed access to the CP for one tenth of the time. So, although these were running slower than the CP, they were able to keep data flowing. The CDC 6600's CP also contained 10 function units internally, which enabled it to work on instructions in parallel. This was the first implementation of a superscalar processor design.
The idea of parallelism has continued to dominate the structure of supercomputers since the CDC 6600. It requires careful programming, mostly because the code has to be split up so that it can run in simultaneous chunks. The next Cray design introduced pipelining, a technique where an instruction unit is broken up into stages so that it can begin work on a new instruction before it has finished the last one.
Superscalar designs with multistage pipelines are now de rigueur in modern desktop processors. VIA's C7 and Intel's Atom are notable non-superscalar exceptions.
Following the vector
The next CDC development introduced another important element that has defined the expansion of supercomputers ever since: vector processing. This technique sees a single operation being performed on multiple data sets at once.
The first system from Seymour Cray's own company – the Cray-1 – used vector processing with the addition of registers. These additions allowed it to apply multiple operations on the same data at once, and necessitated separate vector hardware – something that has been added to desktop CPUs in the form of secondary Single Instruction Multiple Data (SIMD) logic for the last decade.
Get the best Black Friday deals direct to your inbox, plus news, reviews, and more.
Sign up to be the first to know about unmissable Black Friday deals on top tech, plus get all your favorite TechRadar content.
Vector processing has remained the core structure of supercomputer CPUs. The only major additions have been multiprocessing and clustering, which are different levels of essentially the same thing. Multiprocessing groups multiple CPUs into a single computer (also known as a 'node'), while clustering groups together multiple nodes.
The multiprocessor computers can work on multiple streams of data using vector subsystems, so they are called Multiple Instruction Multiple Data systems. So while different supercomputer companies put a varying number and type of CPUs in each node and use a varying number of nodes in their clusters, the overall approach is almost universal.
The upshot of this is that the CPU design itself is no longer the focus of attention. Instead, manufacturers concentrate on how the CPUs are connected together. For example, non-uniform memory access (NUMA) has become a mainstay in supercomputing, particularly with processor designs that include on-die memory controllers.
In the first few decades of the supercomputer, memory was faster than processors, which was one of the main reasons behind the new design created for the CDC 6600. But nowadays CPUs are faster than memory, and this is even more of a problem if memory is shared across lots of processors.
NUMA alleviates this problem by giving each processor its own local memory. But rather than making this entirely discrete, processors can access each other's local memory. The memory and cache controllers associated with each processor must also communicate to maintain coherency. Otherwise, changes in data held locally would not be recognised when the same data is worked on by another processor. Fast connectivity between processors is therefore a necessity.