Gigabit Testbeds Final Report

4.4.4 Parallel Architectures

Highly parallel computers with distributed memory architectures introduce a new issue to high speed host I/O, that of dealing with their internal interconnection networks. For supercomputers such as the TMC CM-5 and Intel Paragon which were used in the testbeds, hundreds of processors, each with individual memory, communicate among themselves and with external I/O interfaces via the internal interconnection network. The latter can take many different forms, for example a two-dimensional mesh in the Paragon and a hierarchical "fat tree" in the CM-5.

Internal bandwidths between individual processors within the interconnection networks also differed according to the interconnection architecture, being on the order of 200 MBytes/s in the Paragon and 10 MBytes/s in the CM-5, with aggregate bandwidth results depending on the architecture. While HIPPI interfaces on the machines were individually capable of 800 Mbps, the ability to move data between the interface and a set of processors at that rate was a significant challenge.

To provide an initial assessment of the CM-5's I/O capabilities, NCSA measured the data rates achievable in moving data between the 512-processor machine and its scalable disk array, theoretically capable of reading data at 112 MBytes/s and writing at 100 MBytes/s. The results were a maximum read rate of 95 MBytes/s and write rate of 35 MBytes/s, with the interconnection network found to be the bottleneck for at least some of the test cases used. In the case of the Paragon, the early versions of its operating system software used in the testbeds severely constrained achievable network I/O rates. Because of the inefficiencies inherent in its protocol support, an order of magnitude speedup was achieved using an internal raw HIPPI interface with the LANL CBI outboard TCP/IP device (discussed above), which allowed most of the normal operating system path to be bypassed.

A Cray Research T3D MPP supercomputer was also used in the Casa testbed later in the project. This machine consisted of 256 processors and a three-dimensional torus interconnection network, with a peak interprocessor capability of 300 MBytes/s. Unlike the Paragon and CM-5, however, its external network connections were made to a Cray YMP used as a front-end for the T3D. Thus the HIPPI network connections were interfaced to the shared-memory YMP and did not have to deal directly with the interconnection network.

iWarp Streams Architecture

Parallel processing work at CMU in the Nectar testbed focused on the iWarp, an experimental parallel computer developed by CMU and Intel. It used a torus interconnection network with 40 MByte/s links to each of 64 processing nodes.

The CAB outboard interfacing engine used for the DecStation HIPPI interfacing discussed earlier was also used with the iWarp, but in a different configuration. Called the iCAB, it provided an external HIPPI interface, network buffering and TCP checksum support as in the workstation CAB, but was closely coupled to two of the iWarp's processors which were used as part of the I/O interface. The processors were responsible for TCP/IP protocol processing and the coupling of the sequential network interface to the iWarp's internal distributed processor/memory system (Figure 4-16).

Figure 4-16. iWarp Distributed Memory I/O

The CMU work made the management of data flows between the I/O interface and the distributed memory array the responsibility of each application, which accomplished this through the use of system library code. The latter was part of a streams package, which consisted of the application library routines and a Streams Manager program executed by the I/O processors. The application (or programmer) thus decided on the best data distribution for its processor/memory array based on knowledge of the application's requirements, while the Streams Manager was responsible for efficient movement of the total set of current data streams.

A reshuffling algorithm was executed by the array processors to map the application's data distribution into one that is efficient for transfer to and from the I/O interface. Experiments with reshuffling resulted in internal aggregate transfer rates of 125 MBytes/s for array block sizes as small as 128 bytes. Without reshuffling, a block size of about 6K Bytes was required to achieve the 125 MByte/s rate, with the 128-byte block size yielding only 40 MBytes/s [5].