[MUD-Dev] Re: TECH: reliablity (was: Distributed Muds)
bruce at puremagic.com
Sat Apr 28 12:59:45 New Zealand Standard Time 2001
(I've re-arranged some of the quoted sections to reply to things more
coherently. (I hope.))
Derek Snider wrote:
> Are you seriously trying to tell me that memory leaks, memory
> corruption, and crashes _never_ happen when you "follow sound, basic
> software engineering priciples"?
No, but they happen much much less often. See below.
> Whatever programming utopia you exist in is far, far, far away from
> the world of commercial game development... or any competitive
> commercial development.
> The larger and more complex a program gets, the more likely something
> is to go wrong, and the harder it is to track it down.
> The only way to have a bug-free program is to write it bug-free in the
> first place.
> Unfortunately this is nearly impossible with high-pressure deadlines.
This seems to be a rather self-defeating viewpoint. :(
For one approach, read what John Buehler was saying here on MUD-Dev
about components some months ago. There's plenty of literature on that
type of strategy as well.
Another approach to avoiding errors is to separate out common classes of
bugs and structure your system so that they are much more difficult to
have happen. With a long running server, it is necessary that the
server both be stable and have a consistent footprint. So just
addressing those points:
* Memory leaks? Use GC. (Or refcounting, ugh.)
* Refcount leaks? Check out some of the Mozilla tools for
detecting and debugging these.
* Bad pointer references, overwriting memory, other
memory errors? Don't use C or C++ for everything.
Use something that is safer like Python, Scheme, Java,
whatever for higher level logic. Alternatively, if
are going to use C or C++ throughout, make sure that
all of your common datastructures have solid unit tests
or come from a known-good-and-stable source (possibly
There are many other concerns as well, such as dealing with complexity
in the internal interfaces. But many of these can be handled at the
architectural level, if anticipated (as they should be). For an example
of dealing with the complexity of interfaces, see the post that I refer
to below about a particular system within the game that I work for, TEC,
But one of the most important things to do is to have a set of unit and
regression tests and to be vigilant in updating them and running them on
all of your core architecture. I don't add core architectural features
to Cold without also adding a set of tests and running those tests under
Purify. (And sometimes performance tests and running them under gprof.)
Finally, ensure that you have code that can help you detect critical
errors as they happen and help to isolate the cause. An example of this
is that in Cold, if a memory allocation fails, currently the server will
panic and attempt to shut down cleanly. I've only seen this happen due
to rogue softcode, so we've got a couple of approaches for dealing with
* Limits on execution time of softcode. You need to yield to
other tasks and if you fail to do so, you get killed. This
is also part of the strategy for ensuring that each task gets
to run regularly as we're a cooperatively tasking system.
* Allocation logging: You can tell the server to log the current
softcode stack trace for any allocation/reallocation above a
given size. This helps detect potential problems within
softcode where it is dealing with datasets that are larger
than expected or are constantly growing.
* Failed allocation logging: The server will log the current
stack trace for every task when an allocation fails and it
is shutting down. This can allow you to determine exactly
what was happening at the time of the failure.
A lot of this helps with the problems that can be caused by having
lesser experienced programmers working at the softcode level by raising
awareness of some of the lower-level issues and not letting them go into
dangerous territory without warning.
Another approach entirely for dealing with allowing staff to extend or
modify the game code would be to stop working with the typical
programming environment and move to something that is rules-based with
an interface that makes it easy for them to modify behavior and
reactions to events. While TEC doesn't (yet?) employ a rules-based
approach, we already support much of the underlying infrastructure for
intercepting events and observing actions which I've previously
described on the list in
These types of things not only -can- serve to reduce the amount of
errors, but -do- in projects that I work on today. As such, I'm hard
pressed to see myself as living in some sort of programmer's utopia.
All of this only helps with errors at the level of writing code though.
Addressing things at the specification level (or even having a
specification) is an entirely different topic. They also don't really
help with solving large-scale architectural problems, such as
complexity, which are things that really need to be addressed at the
specification level. But really, there isn't any reason for a lot of
the common sorts of errors to be a problem if you actively take steps to
mitigate your risk.
> I've used Purify and Insure, and many other memory debuggers, and they
> usally choke on large complex programs that make heavy use of memory.
Insure does indeed fall over (and costs way too much). I've used it on
Linux with big software and watched something take over 10 hours that
usually takes about 5 minutes. However, I've run Purify and other memory
debugging utilities on Solaris (and Linux where they were available)
extensively and on large programs (like Mozilla). Even with a 700-800M
process size, things were still manageable and worked acceptably.
But some of the refcount debugging tools developed for Mozilla as well
as things like the Boehm GC in leak detection mode (especially if you
have some of Patrick Beard's patches to enhance the detection of 'leak
roots' for Mozilla) aren't that bad at all, and are in fact, superior to
Purify within their particular problem domains. The only issue with
them is that you don't get the other sorts of error detection that
Purify provides. But for that, bounded pointer support may make it into
gcc 3.1 (it missed the 3.0 train). There's no reason to not be able to
detect the majority of these types of bugs during routine testing.
That's why Cold is so stable and leak-free. (And the same for TOM as well.)
As a quick aside, in Smaug, you're leaking some memory allocated via the
CREATE macro in mob_act_add() fairly regularly. I didn't see any unit
tests in the 1.4a dist, so that was all I noticed in quickly running it
MUD-Dev mailing list
MUD-Dev at kanga.nu
More information about the MUD-Dev