Gigabit Testbeds Final Report

4.4.5 Distributed Shared Memory

An approach to communication between applications called Distributed Shared Memory (DSM) was investigated by two research groups. In contrast to the message-passing model used by applications in most of the testbed work, DSM attempts to emulate a shared-memory communication model across a network. The major motivation for doing so is to hide the complexities of networked communication from the application programmer, making it appear to be the same as if two or more processes were executing on a single shared-memory computer.

The major obstacle to making DSM work efficiently is end-to-end network latency, which because of propagation delay is typically much greater even across a local area network than the latency encountered within a computer. And while DSM schemes have been implemented successfully for LANs, the challenge for testbed researchers was to make it work across a gigabit wide area network.

As is the case for internal computer architectures, caching provides a basic mechanism for hiding latency, but with the problem of cache consistency greatly magnified by the much larger latencies involved. Thus the testbed work was focused on investigating techniques which could either relax cache synchronization constraints or else reduce the latency seen by an application when remotely generated data is needed, in either case reducing or eliminating the time an application must suspend execution because of a network event.

As part of the work at UIUC, researchers developed a coordinated memory model in which cache synchronization is relaxed by optimistic data sharing. That is, a program continues to execute after changing shared data, optimistically assuming that changes by others are not in transit and performing recovery mechanisms only for the hopefully small number of cases in which conflicts do occur. Experiments were performed over the Blanca testbed using three representative applications to evaluate the technique's effectiveness: a matrix multiplication, a solution technique for partial differential equations, and a quicksort algorithm. The first application represented a reference case in which network latency had no effect on the result, verifying that the coordination scheme did not in fact degrade performance. For the last two applications, a performance improvement of 50 to 60% was achieved when coordination was used with two and three nodes. This represented a near-linear processing gain versus number of nodes speedup, with computation and communication effectively fully overlapped.

In the Aurora testbed, Penn researchers investigated several different latency-hiding techniques. The first was similar to the UIUC approach in that it relaxed cache consistency constraints, using a policy of "read gets recent write" instead of the "read gets last write" policy used with strict consistency. In addition, this approach used two different page sizes, one for control and a second for data, and a mechanism which combined page access with process state control.

Three other Penn efforts investigated ways of directly reducing the latency involved in getting data to the application. The first of these used anticipation, in which data was sent to remote nodes before it was requested, achieving latency reduction through the use of more network bandwidth and storage. The second investigated the use of intelligent caching within the wide area network, with a minimal spanning tree used to reduce propagation delay through strategic selection of cache locations. The third approach exploited knowledge of the application through the use of program-defined objects to achieve optimized caching lookup and reference strategies, building on the ideas of the first two schemes.