[MUD-Dev] A variation on persistent store techniques with notes on distributed processing support

J C Lawrence claw at under.engr.sgi.com
Mon Jun 29 13:33:54 New Zealand Standard Time 1998

Quick aside:

For those interested in the heavily threaded distributed processing
end of things, the following may be of interest:


I could easily see them both being applied to server designs, even
possibly on a Beowolf-style cluster.



uDatabase Abstract 

Programmers working with complex and possibly large persistent data
structures are faced with the problem that there are two, mostly
incompatible, views of structured data, namely data in primary and
secondary storage. In primary storage, pointers are used to construct
complex relationships among data; establishing these relationships
without pointers is often cumbersome and expensive.

Significant research has occurred over the last decade on efficient
and simple-to-use methodologies for constructing, storing, and
subsequently retrieving large persistent data structures in a fashion
that makes the secondary storage transparent to the programmer. The
approaches extend primary storage practices and tools so that they
also apply to secondary storage. Merging primary and secondary storage
in this way produces a single-level store, which gives the illusion
that data on secondary storage is accessible in the same way as data
in primary storage. This uniform view of data eliminates the need for
expensive execution-time conversions of structured data between
primary and secondary storage and allows the expressive power and the
data structuring capabilities of a general purpose programming
language to be used for secondary storage. For complex structures, a
single-level store offers substantial performance advantages over
conventional file access, which is crucial to database applications
such as computer-aided design, text management, and geographical
information systems.

While there are several ways to implement a single-level store, many
projects do so using memory-mapping. Memory mapping is the use of
virtual memory to map files stored on secondary storage into primary
storage so that the data is directly accessible by the processor's
instructions. In this environment, there are no explicit read and
write routine calls to access data on disk. All read and write
operations are done implicitly by the operating system during
execution of a program.  When the working set can be kept in memory,
performance begins to approach that of memory-resident databases.

All approaches to implementing memory-mapped single-level stores have
to deal with the address consistency problem. When data is copied from
secondary to primary storage (actually virtual storage), either the
data must be positioned exactly where it was created (to maintain
integrity of references), or the addresses must be modified to reflect
its new location. The former case is difficult to handle because data
from two or more memory-mapped files may map to the same location,
producing an irreconcilable conflict. The latter case is difficult to
handle because it must be possible to locate all pointers so they can
be updated, and there is the additional runtime cost of modifying the
pointers. Pointer modification may be handled eagerly or lazily; in
general, eager modification of pointers is called relocation and lazy
modification is called pointer swizzling.

We argue that a significant performance advantage of a single-level
store is lost if all the pointers within it have to be relocated or
swizzled. This loss of advantage is especially significant for
operations that incur high overhead in data preparation; examples
include operations like sequential scans, where the data is accessed
only once, and operations that deal with large data structures with
small primary storage, where the data is implicitly fetched and
prepared multiple times. Therefore, we are pursuing the alternative
approach of exact positioning of data so relocation or swizzling of
pointers is significantly or completely eliminated during transfers to
and from secondary storage.

To this end, we are developing uDatabase, a toolkit for building
persistent data structures using the ``exact positioning of data''
approach to memory mapping. uDatabase employs a novel technique that
allows application of an old solution to the problem of address
collisions when mapping multiple memory-mapped data structures. The
old solution is hardware segmentation; each hardware segment is an
address space starting at a virtual zero, in which a persistent data
structure can be built, stored, and subsequently retrieved and
modified. Multiple segments can be simultaneously accessed in a single
application because each segment has its own non-conflicting
address-space. When a segment is mapped into memory, pointers within
the segment do not require modification; pointers outside the segment
do require modification, but in general, these pointers represent a
small percentage of the total number of pointers in a data
structure. Our technique uses the UNIX system call mmap to mimic
segmentation on hardware that does not support it.  Furthermore,
uDatabase has direct access to the concurrency facilities provided by
uC++, which allows a high level of concurrency through


     uDatabase : A Toolkit for Constructing Memory Mapped Databases by
       Peter A. Buhr, Anil K. Goel, and Anderson Wai 

     Parallel Pointer-Based Join Algorithms in Memory-Mapped
       Environments by Peter A. Buhr, Anil K. Goel, Naomi Nishimura, and
       Prabhakar Ragde

     Exact Positioning of Data Approach to Memory Mapped Persistent
      Stores: Design, Analysis and Modelling by Anil K. Goel


J C Lawrence                               Internet: claw at null.net
(Contractor)                               Internet: coder at ibm.net
---------(*)                     Internet: claw at under.engr.sgi.com
...Honourary Member of Clan McFud -- Teamer's Avenging Monolith...

More information about the MUD-Dev mailing list