Multicore Data Structures

The following was written as background material for a patent application on part of Postulate5’s UnThread library (U.S. Patent Application 12/956,054). It is presented here as generally useful information concerning data structures for multithreaded algorithms, which is expected to be a recurrent topic in Dr. Jones’ Journal.

One of the great challenges in recent software engineering has been the struggle to use, effectively and efficiently, the computational resources of a multiprocessor computer. Traditionally, software tasks have been laid out in a linear, sequential fashion that is well-suited for programming on a single processor. However, most current computers have multiple processors, and keeping all these processors suitably occupied is quite difficult if there is, fundamentally, a single task being worked. The first challenge is to break a single task down into subtasks that can proceed in parallel, but a second is to share global resources among these subtasks in such a way as to optimally use available computational resources.

One of the most common methods for subtasking a long serial task is to have multiple subtasks working on overlapping segments of the task. This often requires that multiple processors have shared access to differing windows of a large resource that may be thought of as a virtual array – an array in which only the entries actually being used are actually constructed (or read from external storage). Many buffering schemes, well known in the art, may be employed to reduce the likelihood of needing to reconstruct array entries multiple times. However, none of these are specifically tuned for this particular pattern of access – one in which groups of array entries are typically required and the release of one group of entries will probably be immediately followed by retrieval of a second (possibly overlapping) group of entries.

The present invention provides a flexible solution to this problem, as well as providing a multiple-strategy mechanism for synchronization of changes to the array entries. The essence of the invention is to separate “master” array entries (constructed from external resources) from “slave” array entries (separate copies given to individual clients). The slave entries may be simply copies of the master entries or may be derived from them in a more complex, user-specified manner. Writes are done directly to the master entries, reads are done from the slave entries. We will refer to an instance of the invention as a Shared Virtual Array (SVA).

The use of memory to store the results of expensive calculations or interactions with storage media dates back to the earliest days of computer programming (see Knuth, The Art of Computer Programming, Volume I, Second Editions, sections 1.4.4-5 for a thorough and readable account of this). Originally, the impetus for the idea seems to have been mainly rate-matching between slow peripherals and much more rapid ALU processing, but it also became apparent early that by storing an expensive result for a while, one might obviate the need to reacquire that result, should it be needed later. Thus, for example, all contemporary operating systems internally buffer the physical sectors read from or written to a storage device in case some other task needs to read or write the same sector again soon. The data is normally retained in a fixed-size pool of buffers on a Most-Recently-Used (MRU) basis. Similarly, prudent software design dictates that the results of calculations are routinely retained until it is clear that they will not be needed again.

In addition, prudent hardware design (see Alvarez et al. (U.S. Patent 3,723,976, 3/27/1973) and Anderson et al. (U.S. Patent 3,735,360, 5/22/1973)) has long made use of high-speed (relative to main memory) cache memories to reduce bus bandwidth requirements, to effectively pipeline processor instructions, and to assist with memory sharing when multiple processors are performing related tasks.

More recently, as computer graphics processing has become more multiprocessor-based, Graphics Processing Unit (GPU) designers have applied similar ideas to their art – see Acocella et al. (U.S. Patent 7,750,915 B1, 7/6/2010) and Mrazek et al. (U.S. Patent 7,755,631 B1, 7/13/2010). As important as these ideas are to CPU and GPU design, their effect on software design has been minimal. Most software design that requires sharing of expensive array entries is typically ad hoc, employing some variant on the use of MRU-queued buffers with a backup method to re-compute or reread any entries that are needed after the source has fallen off the queue. There is no systematic method for efficiently buffering expensive array entries with an access pattern more typical of a multithreaded monolithic task.

More specifically, when multiple threads each access a more or less fixed size group of array entries, the MRU mechanism is no longer sufficient when a thread shifts its attention from one group of entries to another partially overlapping group. All entries in the former group may be essentially tied with one another on a MRU basis, but any array entry that is also in the latter group should be retained, since the thread is about to access it again. Access patterns, not MRU rules, are the real basis on which decisions about discarding expensive data should be made – MRU rules have always been simply used as a convenient proxy for access patterns. The difficulty, of course, is in providing an interface through which access patterns may be easily communicated to the system providing the buffering for the expensive data.

By Kerry N. Jones
Kerry is a co-founder and CTO of Postulate5 and holds a Ph.D. in mathematics from Rice University.