[MUD-Dev] [TECH] Distributed MUD

J C Lawrence claw at 2wire.com
Wed Apr 25 17:39:49 New Zealand Standard Time 2001

On Wed, 25 Apr 2001 00:50:58 -0600 
Kwon Ekstrom <justice at softhome.net> wrote:

> There hasn't been much posted here about distributing by role.

Sort of.  That's one corner of what I've been referring to as
predictive segmentation and distribution.

> What little there has been was generally on a side note.


> IMHO, to distribute by role, you would need at least 3 basic types
> of server.  A front end user handler, backend database, and
> backend handler.

Yes and no.  At the functional level the basic block diagram
certainly seems to look something like:

                                 ^|                      ||
                                 ||                      ||
                                 |v                      |v
                                 ^|                      ^|
                                 ||                      ||
                                 |v                      |v
                                 ^|                      ^|
                                 ||                      ||
                                 |v                      |v
                         [AutomationSystem]    [ReplicationServers]

One of the interesting bits of this is that the distribution model
doesn't need to follow the block diagram.  While there's a certain
architectural simplicity in making the backing store a discrete
system, it encourages a significant IPC overhead and various sorts
of cache contention problems.  One of the more interesting
approaches to me is distributing the Command/EventProcessor boxes
across N systems, and having the backing store be a distributed
autonomous database across the same set of systems, where the
cluster as a whole manages cache locality against a shared-write
physical store (eg an ObjectStore ala Panasas).  

What I've been thinking about this morning is a method for
dynamically distributing load across an arbitrary cluster of
processor boxes.  Doing it "properly" is expensive and complex.  My
interest is in whether a moderately cheap approximation can do well
enough for the interesting cases.

Consider the above diagram with the following change:

                                 ^|                      ||
                                 ||                      ||
                                 |v                      |v
                                 ^|                      ^|
                                 ||                      ||
                                 ||                      ||
                                 |v                      ||
    [AutomationSystem]<===>[Distributor]                 ||
                                 ^|                      ||
                                 ||                      ||
                                 ||                      ||
                                 |v                      |v
What happens is:

  Commands come in off the Connection server.  Automation "commands"
  come in off the Automation system.  Both go through a clearing
  house system I'll call the "Distributor".  Each item that is sent
  to the Bistributor has a concept of an "Owner" (likely the
  ObjectID that "owns" the command.).

  The Distributor farms the resulting "processes" out to machines in
  the Command/EventProcessor cluster in whatever manner it sees fit.

  Unpon compleation of a task the CEP nodes report back to the
  Distributor the list of modified objects, and the list of
  referenced objects in the given transaction.  In this manner the
  Distributor can maintain an ongoing definition of the current
  working set for each "owner" that it is processing items for.

  The Distributor can now build a graph of the owners against the
  working sets, with modified working set objects given priority
  over merely referenced objects.  Relatively simple graph traversal
  (think about it) will allow "blobs" in the graph to be located,
  and crude approximation of segmentation of great pool of working
  sets into N pools, where N is the number of nodes in the CEP
  cluster, along lines of minimum intersection.  

    Getting perfect graph segmentation appears hard at first blush.
    The same first blush suggests that a simple walk across the
    aggregate space spotting valleys (areas of low intersection)
    should be comparitively cheap, and simple raising of the
    threshhold in the case of the space between two valleys being
    too large (very large highly intersecting collection of working
    sets) will likely give a Good Enough approximation of reasonable
    division lines.


The idea is to not get involved in the complexity of doing a full
CC:NUMA architecture, es[ecially in regards to proper cache
management and cache line exchange, and instead to do a Merely Good
Enough stab at at second order approximation of cache management to
dynamically build a Probably Good Enough dynamic cluster allocation

> I'm not sure how you'd setup the database, but it should be
> written specifically to handle the objects being used by the
> server.  The problem comes with synchronizing data, and managing
> bandwidth.

Bandwidth is the minor problem -- 10Gig-E is already available.
Cache coherency and lock contention/synchronisation are what scare

J C Lawrence                                       claw at kanga.nu
---------(*)                          http://www.kanga.nu/~claw/
--=| A man is as sane as he is dangerous to his environment |=--
MUD-Dev mailing list
MUD-Dev at kanga.nu

More information about the MUD-Dev mailing list