Types of Parallel Computers

There are many types of computers available today, from single processor or 'scalar' computers to machines with vector processors to massively parallel computers with thousands of microprocessors. Each platform has its own unique characteristics. Understanding the differences is important to understanding how best to program each. However, the real trick is to try to write programs that will run reasonably well on a wide range of computers.

Scalar computers

Your typical PC or workstation is a scalar, or single-processor, computer. These use many different processors, and run a wide variety of operating systems. Below is a list of some current (Summer 2003) processors and the operating systems each can run. Linux is now available for almost all processors.

Intel Pentiums
AMD athlons
--> 3.0 GHz Linux
MS Windows
FreeBSD
NetBSD
AMD Opteron
Intel Itanium2
--> 1.8 GHz Linux
MS Windows
FreeBSD
NetBSD
IBM Power4 --> 1.7 GHz AIX
Linux
Sun UltraSparcIII --> 1.2 GHz Solaris
Linux
SGI R16000 --> 700 MHz IRIX
Motorola G5 --> 2.0 GHz Mac OS X
Linux
Alpha 21264 --> 1.26 GHz Tru64 Unix
Linux


Parallel vector processors

The great supercomputers of the past were built around custom vector processors. These are the expensive, high performance masterpieces pioneered by Seymor Cray. There are currently only a few examples of computers in production that still use vector processors, and all are parallel vector processors (PVP's) that run small numbers of vector processors within the same machine.

NEC Earth Simulator 8 GFlops / 500 MHz proc --> 41 TFlops peak
Cray SV1 8 GFlops / vector proc --> 64 GFlops/cabinet
NEC SX-6 (Cray SX-6) 8 GFlops / proc
Cray SV2 --> 10's of TFlops

Vector processors operate on large vectors of data at the same time. The compiler automatically vectorizes the innermost loops to break the work into blocks, often of 64 elements in size, when possible. The functional units are pipelined to operate on all 64 elements within a single clock cycle, and the memory subsystem is optimized to keep the processors fed at this rate. Since the compilers do much of the optimization automatically, the user only needs to attack those problem areas where there is some impediment to the compiler understanding how to vectorize the loop.

MPI, compiler directives, OpenMP, and pThreads packages can be used to parallelize a program to run across multiple vector processors.


Shared-memory multiprocessors

SMP systems have more than 1 scalar processors that share the same memory and memory bus. This category includes everything from a dual-processor Intel PC to a 256 processor Origin3000. Each processor may have its own cache on a dedicated bus, but all processors are connected to a common memory bus and memory bank.

In a well designed system, the memory bus must be fast enough to keep data flowing to all the processors. Large data caches are also necessary, as they allow each processor to pull data into 'local' memory to crunch the data while other processors use the memory bus. Most current SMP systems share these two design criteria. Early Pentium SMP systems, and the multiprocessor nodes of the Intel Paragon, did not and therefore the additional processors were relatively useless. Below are some other SMP systems, and the number of processors that each system can have.

SGI Origin3000
SGI Origin2000
SGI Origin200
--> 256 MIPS R12000 procs
--> 128 MIPS R10000 procs
2-4 MIPS R10000 procs
IRIX
Compaq ES40
Compaq DS20
2-4 Alpha 21264 procs
2 Alpha 21264 procs
Tru64 or Linux
IBM 43p
IBM 44p
IBM ??
1-2 Power3 procs
2-4 Power3 procs
--> 32 Power4 procs
AIX or Linux
Intel Pentium
AMD Athlon
AMD Opteron
Intel Itanium2
2-4 Xeon procs
1-2 Athlon procs
1-2 Opteron procs
1-2 Itanium2 procs
MS Windows
Linux
others

SMP systems can be programmed using several different methods. A multithreading approach can be used where a single program is run on the system. The program divides the work across the processors by spawning multiple light-weight threads, each executing on a different processor and performing part of the calculation. Since all threads share the same program space, there is no need for any explicit communication calls.

Compilers are sophisticated enough to using multithreading to automatically parallelize some codes. Unfortunately, this is not the case very often. While using multithreading may be the most efficient way to program SMP systems for some applications, it is not easy to do it manually. Plus there are many choices when it comes to choosing a multithreading package. OpenMP may be the most popular standard at the moment, but there are many vendor specific packages and other standards such as the POSIX pThreads package.

In summary, multithreading may produce the most efficient code for SMP systems, but it may not be the easiest way parallelize a code. It also may not be portable, even across SMP systems, unless a standard like OpenMP is chosen that is supported almost everywhere. Of even more concern is that a multithreading code will not run on distributed memory systems.

In the message-passing paradigm, each processor is treated as a separate computer running a copy of the same program. Each processor operates on a different part of the problem, and data exchange is handled by passing messages. More details about this approach will be presented in the next section.

While not as efficient as using multithreading, the resulting code is much more portable since message-passing is the predominant method for programming on distributed memory systems. Message-passing implementations on SMP systems are very efficient since the data doesn't have to traverse the network as in distributed memory systems. It is just copied from memory to memory, which should occur at high speed and with little overhead.

Distributed memory MPPs

There is an even wider variety of the largest MPP (massively parallel processor) systems, the distributed memory computers. However, these systems all share a few traits in common. These systems are made of many individual nodes, each of which is essentially an independent computer in itself. In fact, in the case of workstation/PC clusters, each node is a computer in itself.

Each node consists of at least one processor, its own memory, and a link to the network that connects all nodes together. Aside from these generalizations, distributed memory systems may look very different.

Traditional MPP systems may contain hundreds or thousands of individual nodes. They have custom networks that are typically very fast with relatively low latencies. The network topologies very widely, from completely connected systems to 2D and 3D meshes and fat trees. Below is a partial listing of some of the more common MPP systems available today, and some of the characteristics of each.


Cray X1 --> ? nodes ? MB/sec ? µs
Cray T3E --> 1024 nodes Alpha 21164 3D Torroid 340 MB/sec 2-3 µs
IBM SP --> 6656 nodes Power3/4 Colony Switch 340 MB/sec (off-node)
540 MB/sec (on-node)
20 µs (off-node)
10 µs (on-node)
Intel Paragon --> 1836 nodes i860 proc 2D mesh 130 MB/sec 100 µs

There have been many other large MPP systems over the last decade. While some systems may remain, the companies have gone out of business. These include machines such as the Thinking Machines CM-5, the Kendall Square systems, the nCube 2, and the Intel Paragon listed above. These systems soared in popularity, then crashed just as fast. The reasons for this vary, from the difficulty of surviving the high cost of using custom components (especially processors) to having good technology but a poor business model.

MPP systems are programmed using message-passing libraries. Most have their own custom libraries, but all current systems also support the industry standard MPI message-passing interface. This allows codes programmed in MPI to be portable across distributed memory systems, and also SMPs as described previously. Unfortunately, the MPI implementation is not always as efficient as the native communication library, so there is still some temptation to us the native library at the expense of program portability. The Cray T3E is an extreme case, where the MPI implementation only achieves 160 MB/sec while the native SHMEM library can deliver twice that rate.

Cluster computers

Distributed memory computers can also be built from scratch using mass produced PCs and workstations. These cluster computers are referred to by many names, from a poor-man's supercomputer to COWs (clusters of workstations), and NOWs (networks of workstations).

They are much cheaper than traditional MPP systems, and often use the same processors, but are more difficult to use since the network capabilities are currently much lower. Cluster computers are also usually much smaller, most often involving fewer than 100 computers. This is in part because the networking and software infrastructure for cluster computing is less mature, making it difficult to make use of very large systems at this time. Below is a list of some local clusters and their characteristics, plus some other notable systems from around the country.


ALICE 64-node dual-Pentium cluster Fast Ethernet 8.5 MB/sec ~100 µs Linux
Gecoa 24-node Alpha 21164 cluster Gigabit Ethernet 30 MB/sec ~100 µs Linux
Medusa 11-node 2.4 GHz dual-Xeon cluster Gigabit Ethernet 110 MB/sec 62 µs Linux
IBM Cluster 180 proc Power3 IBM cluster Myrinet and GigE
200-100 MB/sec ?-~100 µs AIX
SCI Cluster 64-node dual-Athlon MP2200 cluster SCI in a 2D mesh ? MB/sec ? µs Linux
C-Plant 1536 node 466 MHz Alpha 21264 cluster Myrinet ~100 MB/sec ??? µs Linux

C-Plant Phase IIC-Plant Phase III

One look at the communication rates and message latencies shows that they are much worse than for traditional MPP systems. Obviously you get what you pay for, but the networks for cluster computers are quickly closing the gap.

It is therefore more difficult to get many programs to run well on clusters computers. Many applications will not scale to as many nodes due to the slower networking, and some codes simply will not run well at all and must be limited to MPP systems.

There is also a wide range of capabilities illustrated by the mixture of clusters above. This range starts with the ultra-cheap PC clusters connected by Fast Ethernet. These systems can be built for tens of thousands of dollars, but the slow speed of the Fast Ethernet interconnects greatly limits the number of codes that can utilize this type of a system.

The workstation clusters cost more, but can handle faster networking. Gigabit Ethernet is maturing to where it can deliver up to 100 MB/sec, albeit at fairly high latencies at this point. Custom solutions such as Myrinet can deliver slightly faster rates at lower latencies, but also cost more. These are therefore only appropriate for clusters made from more costly workstations.

Network and internet computing

Cluster computers are made from many computers, usually identical, that are located in the same room. Heterogeneous clusters use different computers, and are much more difficult to program because the computing rates and even the communication rates may vary.

Network computing, or internet computing, is using a heterogeneous mixture of geographically separated workstations to perform calculations. The idea of using the spare cycles on desktop PCs or workstations has been around for years. Unfortunately, it only works for a very limited number of applications due to the very low communication rate of the network, the large latencies, the differing CPU rates. and the need to allow for fault tolerance.

The SETI at home project is probably the most famous application that can make use of any spare cycles on the internet. There are also commercial firms that help companies to do the same.

Metacomputing

Metacomputing is a similar idea, but with loftier goals. Supercomputers that may be geographically separated can be combined to run the same program. However, the goal in metacomputing is usually to provide very high bandwidths between the supercomputers so that these connections do not produce a bottleneck for the communications. Scheduling exclusive time on many supercomputers at the same time can also pose a problem. This is still an area of active research.

Distributed memory systems with SMP nodes

The memory subsystems of PCs and workstations are rapidly improving to where they can support more processors. Large cache sizes and more memory bandwidth are instrumental to this success. With these improvements, it is becoming more common to find distributed memory systems built with SMP nodes.

This has been done in the past. The Intel Paragon is one example, where the MP nodes had two i860 compute processors in addition to one as a communication coprocessor. Unfortunately, the cache size and memory bandwidth were so small, and the time to switch which processor had access to main memory was so great, that the second processor was pretty much useless except for heating the computer room.

Our IBM cluster is one example, where each node consists of a dual-processor or quad-processor IBM workstation. Many smaller clusters use dual-Xeon or dual-Athlon nodes. IBM MPP systems are currently being built with 16-way SMP nodes. Compaq's wildfire systems are built essentially with 4-processor ES40s.

How best to program in this mixed environment of distributed SMP systems is still a very open question. The most efficient method would probably be to do message-passing between SMP nodes and multithreading within, but this requires two levels of parallelization by two distinctly different methods. This takes a lot of programming effort, and presents the same difficulties as multithreading itself. The more common approach is to use message-passing for everything. This is easiest, and the most portable approach, but not the optimal choice. Research is still needed in this area.


Links to more advanced topics


Ames Laboratory | Condensed Matter Physics | Disclaimer | ISU Physics