Types of Parallel Computers
There are many types of computers available today, from
single processor or 'scalar' computers to machines with vector
processors to massively parallel computers with thousands of
Each platform has its own unique characteristics. Understanding
the differences is important to understanding how best to program each.
However, the real trick is to try to write programs that will run
reasonably well on a wide range of computers.
Your typical PC or workstation is a scalar, or single-processor,
computer. These use many different processors, and run a wide variety
of operating systems. Below is a list of some current (Summer 2003)
processors and the operating systems each can run. Linux is now available
for almost all processors.
| Intel Pentiums
| --> 3.0 GHz
| AMD Opteron
| --> 1.8 GHz
| IBM Power4
|| --> 1.7 GHz
| Sun UltraSparcIII
|| --> 1.2 GHz
| SGI R16000
|| --> 700 MHz
| Motorola G5
|| --> 2.0 GHz
|| Mac OS X
| Alpha 21264
|| --> 1.26 GHz
|| Tru64 Unix
Parallel vector processors
The great supercomputers of the past were built around custom
vector processors. These are the expensive, high performance
masterpieces pioneered by Seymor Cray.
There are currently only a few examples of computers in production
that still use vector processors, and all are parallel vector processors
(PVP's) that run small numbers of vector processors within the same
| NEC Earth Simulator
|| 8 GFlops / 500 MHz proc
|| --> 41 TFlops peak
| Cray SV1
|| 8 GFlops / vector proc
|| --> 64 GFlops/cabinet
| NEC SX-6 (Cray SX-6)
|| 8 GFlops / proc
| Cray SV2
|| --> 10's of TFlops
Vector processors operate on large vectors of data at the same time.
The compiler automatically vectorizes the innermost loops to break
the work into blocks, often of 64 elements in size, when possible.
The functional units are pipelined to operate on all 64 elements
within a single clock cycle, and the memory subsystem is optimized
to keep the processors fed at this rate.
Since the compilers do much of the optimization automatically, the user
only needs to attack those problem areas where there is some impediment
to the compiler understanding how to vectorize the loop.
MPI, compiler directives, OpenMP, and pThreads packages can
be used to parallelize a program to run across multiple vector
SMP systems have
more than 1 scalar processors that share the same memory and
This category includes everything from a dual-processor Intel PC
to a 256 processor Origin3000.
Each processor may have its own cache on a dedicated bus, but
all processors are connected to a common memory bus and memory
In a well designed system, the memory
bus must be fast enough to keep data flowing to all the processors.
Large data caches are also necessary, as they allow each processor
to pull data into 'local' memory to crunch the data while other
processors use the memory bus. Most current SMP systems share these
two design criteria. Early Pentium SMP systems, and the multiprocessor
nodes of the Intel Paragon, did not and therefore the additional processors
were relatively useless. Below are some other SMP systems, and the
number of processors that each system can have.
| SGI Origin3000
| --> 256 MIPS R12000 procs
--> 128 MIPS R10000 procs
2-4 MIPS R10000 procs
| Compaq ES40
| 2-4 Alpha 21264 procs
2 Alpha 21264 procs
| Tru64 or Linux
| IBM 43p
| 1-2 Power3 procs
2-4 Power3 procs
--> 32 Power4 procs
| AIX or Linux
| Intel Pentium
| 2-4 Xeon procs
1-2 Athlon procs
1-2 Opteron procs
1-2 Itanium2 procs
| MS Windows
SMP systems can be programmed using several different methods.
A multithreading approach can be used where a single program is
run on the system. The program divides the work across the processors
by spawning multiple light-weight threads, each executing on a different
processor and performing part of the calculation.
Since all threads share the same program space, there is no need for
any explicit communication calls.
Compilers are sophisticated enough to using multithreading to
automatically parallelize some codes. Unfortunately, this is not
the case very often. While using multithreading may be the
most efficient way to program SMP systems for some applications, it is not
easy to do it manually. Plus there are many choices when it comes to choosing
a multithreading package. OpenMP may be the most
popular standard at the moment, but there are many vendor specific
packages and other standards such as the
POSIX pThreads package.
In summary, multithreading may produce the most efficient code
for SMP systems, but it may not be the easiest way parallelize a code.
It also may not be portable, even across SMP systems, unless a standard
like OpenMP is chosen that is supported almost everywhere. Of even
more concern is that a multithreading code will not run on distributed
In the message-passing paradigm, each processor is treated as a
separate computer running a copy of the same program. Each processor
operates on a different part of the problem, and data exchange is
handled by passing messages. More details about this approach will
be presented in the next section.
While not as efficient as using multithreading, the resulting code
is much more portable since message-passing is the predominant method
for programming on distributed memory systems. Message-passing implementations
on SMP systems are very efficient since the data doesn't have to
traverse the network as in distributed memory systems. It is just copied
from memory to memory, which should occur at high speed and with little
Distributed memory MPPs
There is an even wider variety of the largest MPP (massively
parallel processor) systems, the
distributed memory computers. However, these systems all share a few
traits in common. These systems are made of many individual nodes, each
of which is essentially an independent computer in itself. In fact, in
the case of workstation/PC clusters, each node is a computer in itself.
Each node consists of at least one processor, its own memory, and a link to
the network that connects all nodes together. Aside from these generalizations,
distributed memory systems may look very different.
Traditional MPP systems may contain
hundreds or thousands of individual nodes. They have custom networks
that are typically very fast with relatively low latencies.
The network topologies very widely, from completely connected systems
to 2D and 3D meshes and fat trees. Below is a partial listing of some
of the more common MPP systems available today, and some of the characteristics
| Cray X1
|| --> ? nodes
|| ? MB/sec
|| ? µs
| Cray T3E
|| --> 1024 nodes
|| Alpha 21164
|| 3D Torroid
|| 340 MB/sec
|| 2-3 µs
| IBM SP
|| --> 6656 nodes
|| Colony Switch
|| 340 MB/sec (off-node)
540 MB/sec (on-node)
| 20 µs (off-node)
10 µs (on-node)
| Intel Paragon
|| --> 1836 nodes
|| i860 proc
|| 2D mesh
|| 130 MB/sec
|| 100 µs
There have been many other large MPP systems over the last decade.
While some systems may remain, the companies have gone out of business.
These include machines such as the Thinking Machines CM-5, the Kendall
Square systems, the nCube 2, and the Intel Paragon listed above.
These systems soared in popularity, then crashed just as fast. The
reasons for this vary, from the difficulty of surviving the high cost
of using custom components (especially processors) to having good
technology but a poor business model.
MPP systems are programmed using message-passing libraries. Most
have their own custom libraries, but all current systems also support
the industry standard
message-passing interface. This allows
codes programmed in MPI to be portable across distributed memory
systems, and also SMPs as described previously. Unfortunately, the
MPI implementation is not always as efficient as the native communication
library, so there is still some temptation to us the native library
at the expense of program portability. The Cray T3E is an extreme
case, where the MPI implementation only achieves 160 MB/sec while
the native SHMEM library can deliver twice that rate.
Distributed memory computers can also be built from scratch using
mass produced PCs and workstations. These cluster computers are
referred to by many names, from a poor-man's supercomputer to
COWs (clusters of workstations), and NOWs (networks of workstations).
They are much cheaper than traditional MPP systems, and often use the
same processors, but are more difficult to use since the network capabilities
are currently much lower. Cluster computers are also usually much
smaller, most often involving fewer than 100 computers. This is in
part because the networking and software infrastructure for cluster
computing is less mature, making it difficult to make use of very
large systems at this time. Below is a list of some local clusters and
their characteristics, plus some other notable systems from around the
|| 64-node dual-Pentium cluster
|| Fast Ethernet
|| 8.5 MB/sec
|| ~100 µs
|| 24-node Alpha 21164 cluster
|| Gigabit Ethernet
|| 30 MB/sec
|| ~100 µs
|| 11-node 2.4 GHz dual-Xeon cluster
|| Gigabit Ethernet
|| 110 MB/sec
|| 62 µs
| IBM Cluster
|| 180 proc Power3 IBM cluster
|| Myrinet and GigE
| 200-100 MB/sec
|| ?-~100 µs
| SCI Cluster
|| 64-node dual-Athlon MP2200 cluster
|| SCI in a 2D mesh
|| ? MB/sec
|| ? µs
|| 1536 node 466 MHz Alpha 21264 cluster
|| ~100 MB/sec
|| ??? µs
C-Plant Phase IIC-Plant Phase III
One look at the communication rates and message latencies shows
that they are much worse than for traditional MPP systems.
Obviously you get what you pay for, but the networks for cluster
computers are quickly closing the gap.
It is therefore more difficult to get many programs to run well on clusters
computers. Many applications will
not scale to as many nodes due to the slower networking, and some codes
simply will not run well at all and must be limited to MPP systems.
There is also a wide range of capabilities illustrated by the mixture
of clusters above. This range starts with the ultra-cheap PC clusters
connected by Fast Ethernet. These systems can be built for tens of
thousands of dollars, but the slow speed of the Fast Ethernet interconnects
greatly limits the number of codes that can utilize this type of a system.
The workstation clusters cost more, but can handle faster networking.
Gigabit Ethernet is maturing to where it can deliver up to 100 MB/sec,
albeit at fairly high latencies at this point. Custom solutions such
as Myrinet can deliver slightly faster rates at lower latencies, but
also cost more. These are therefore only appropriate for clusters made
from more costly workstations.
Network and internet computing
Cluster computers are made from many computers, usually
identical, that are located in the same room. Heterogeneous
clusters use different computers, and are much more difficult
to program because the computing rates and even the communication
rates may vary.
Network computing, or internet computing, is using a heterogeneous
mixture of geographically separated workstations to perform calculations.
The idea of using the spare cycles on desktop PCs or workstations has
been around for years. Unfortunately, it only works for a very limited
number of applications due to the very low communication rate of the
network, the large latencies, the differing CPU rates. and the need to
allow for fault tolerance.
The SETI at home project
is probably the most famous
application that can make use of any spare cycles on the internet.
There are also commercial firms that help companies to do the same.
Metacomputing is a similar idea, but with loftier goals. Supercomputers
that may be geographically separated can be combined to run the same
program. However, the goal in metacomputing is usually to provide
very high bandwidths between the supercomputers so that these connections
do not produce a bottleneck for the communications. Scheduling exclusive
time on many supercomputers at the same time can also pose a problem.
This is still an area of active research.
Distributed memory systems with SMP nodes
The memory subsystems of PCs and workstations are rapidly improving
to where they can support more processors. Large cache sizes and
more memory bandwidth are instrumental to this success. With these
improvements, it is becoming more common to find distributed memory
systems built with SMP nodes.
This has been done in the past. The Intel Paragon is one example,
where the MP nodes had two i860 compute processors in addition to one
as a communication coprocessor. Unfortunately, the cache size and
memory bandwidth were so small, and the time to switch which processor
had access to main memory was so great, that the second processor was
pretty much useless except for heating the computer room.
Our IBM cluster is one example, where each node consists of a
dual-processor or quad-processor IBM workstation.
Many smaller clusters use dual-Xeon or dual-Athlon nodes.
IBM MPP systems are currently being built with 16-way SMP nodes.
Compaq's wildfire systems are built essentially with 4-processor ES40s.
How best to program in this mixed environment of distributed SMP systems
is still a very open question. The most efficient method would probably be
to do message-passing between SMP nodes and multithreading within,
but this requires two levels of parallelization by two distinctly
different methods. This takes a lot of programming effort, and
presents the same difficulties as multithreading itself. The more
common approach is to use message-passing for everything. This
is easiest, and the most portable approach, but not the optimal choice.
Research is still needed in this area.
Links to more advanced topics
Ames Laboratory |
Condensed Matter Physics |