[MUD-Dev] bytecode results

Chris Gray cg at ami-cg.GraySage.Edmonton.AB.CA
Thu Jun 18 22:11:38 New Zealand Standard Time 1998


As I promised a month or two ago, here are the results from my experiments
with putting a bytecode interpreter into my server. This was mostly written
back then, with a few words added now. It is nice to back *using* a computer
instead of administering it! This will be part of a document included with
my system. This will be a document included with the system, hence the
format and the introductory material.

Hmm. I'll have to invent a new name for my system, now that I will no longer
run on just Amiga's. Any suggestions? Is 'CGMud' too lame?



AmigaMUD, Copyright 1998 by Chris Gray


	The ByteCode system in AmigaMUD


What is Bytecode?

Bytecode is just a different internal representation for the actions
(functions) in the system. It is slightly more compact, but, most
importantly, it is easier to interpret, so executes faster. I'm not
certain of the name, but it likely comes from the fact that the
"opcodes" in bytecode are always exactly one byte long. The bytecode
interpreter in AmigaMUD executes instructions for a simple virtual
machine, just like a Java virtual machine does. Again, like the Java
virtual machine, the AmigaMUD virtual machine is a stack machine, in
that it doesn't have any registers (other than the program counter,
the stack pointer and a frame pointer), and so pushes and pops things
from the stack a lot. The speed increase comes from the simplicity of
the decoding of the bytecode (each instruction starts with a one-byte
code that completely specifies what it does), and the fact that the
bytecode "compiler" can make some optimizations.


Why Did I Add the Bytecode system?

The non-bytecode interpreter in AmigaMUD is fairly efficient, quite a
bit moreso than the systems in many MUD servers. This is mostly due to
the strongly typed nature of the AmigaMUD programming language, and to
the effort I put into making it fast. However, when I wanted to
investigate fractal world generation, I found that it was far too
slow in AmigaMUD. So, I thought about ways of speeding it up. I
could have made the system generate native 68000 code, since I have
experience in 68000 code generation. That would have given me the
fastest execution. However, I want my system to be portable to other
architectures, and the effort of doing native code generation on all
of the various UNIX platforms, as well as for the ubiquitous Intel x86
architecture is very large. Using a bytecode system can keep my system
portable, but still get good speedup.


Why Didn't I Use Java Bytecode?

There are three main reasons for this. I won't try to put them in any
order of importance, because I don't know the order:

    - it was lots of fun to do my own system

    - Java bytecode is a lot more complicated than mine (I don't have
	anything like Java's constant table), and has many features
	that I don't need.

    - my bytecode can run faster than Java's, mostly because I
	specifically designed it for fast execution. Also, many of the
	features that slow down Java are not present in the AmigaMUD
	programming language, so I could use simpler techniques.


Did I Reach My Goals?

At first, I was quite disappointed in the speedups I was getting with
bytecode. For my first fractal terrain generator, it was difficult to
reach even a 3 times speedup over the old interpreter. I had expected
much more speedup, and could not explain the small gains. Then I
thought about it differently, and thought that perhaps for that code,
the original interpreter was faster than I had thought. I went to a
native code version of the same program, and discovered that it only
ran 6.5 times faster than the bytecode version! That ratio is almost
unheard-of for any kind of interpretation, so I had no reason to be
dissatisfied.

Here is one test program that I used. It is a solver for the "Towers
of Hanoi" problem, but without any printouts. Printouts would limit
the speed of any version to system imposed speeds, so they have been
left out.

    hanoi: proc, owner SysAdmin, useCount 3:
    proc hanoi(int fromPeg, toPeg, usingPeg, n)void:
	if n ~= 0 then
	    hanoi(fromPeg, usingPeg, toPeg, n - 1);
	    hanoi(usingPeg, toPeg, fromPeg, n - 1);
	fi;
    corp

The timings I got for it are:

     n Old  New  Factor  Native  Factor
    -----------------------------------
    18	43    7    6.1	  0.62	  11.2
    19	87   15    5.8	  1.26	  11.9
    20 170   29    5.9	  2.54	  11.4

So, the bytecode interpreter runs about 6 times faster than the old
interpreter, and about 12 times slower than native code, for this
routine. The absolute times are for a 25 Mhz 68040.

The actual fractal terrain generator has somewhat different numbers:

    256 x 256 fractal terrain generation:

    Old  New  Factor  Native  Factor
    --------------------------------
     43   13   3.3	2.0    6.5

Here, the bytecode is 3.3 times faster than the old interpreter, and
6.5 times slower than native code.

Why is there such a large difference in the ratios? I believe this is
an artifact of the nature of the code, and the architecture of the
68000 family of CPUs, and that the ratios would be quite different for
other CPUs. The bytecode machine has 3 address registers (PC, SP, FP,
as mentioned earlier), and needs upto 4 or 5 other 32 bit values in
some instructions. These fit nicely into the registers available in
the 68000 family, and so the bytecode interpreter runs well. The
fractal program has a lot of local variables, more than will fit into
the registers, so it doesn't run very well (at least with the compiler
I used) as native code. So, the slowdown from native to bytecode is
not that much.

On other CPU architectures, things are different. On the register-poor
Intel x86 architecture, the compiler may not be able to keep all of
the virtual machine's registers in real registers, and so the bytecode
will not be as effective. On RISC CPU's, which typically have 32
registers, all of the local variables for the fractal routine will fit
into real registers, and it will run correspondingly faster. In both
of those cases, the slowdown from native code to bytecode will be
larger than I observed for the 68000 family.


Is More Speedup Possible?

There are some things that can be done to squeeze a bit more speedup
from the bytecode interpreter, but they are probably pretty minimal.
One of the things that slows down the bytecode interpreter is the fact
that it must fetch bytes of longer values (16 bit or 32 bit) from the
instruction stream one byte at a time. This is because those values
may not be aligned, and so the interpreter cannot use the 16 or 32 bit
load instructions, which require aligned operands on the 68000 family.
On CPU's which do not require such alignment, those instructions might
be usable, so long as there are no byte-ordering problems (the byte-
order for AmigaMUD bytecode is big-endian). Even with endian
differences, optimization of the bytecode on first execution can be
done, re-arranging the bytes to be in the required order for the
native CPU. On a CPU that does not allow unaligned accesses, some of
the operands will end up properly aligned, just by chance, and the
system could detect those cases and replace the instructions with
equivalent ones that just use the 16 or 32 bit fetch. This is possible
since, once bytecode is present in the server's memory for a function,
it is not moved. A more extreme possiblity is to change from bytecode
to wordcode, where the instructions are 32 bit values, and everything
ends up being properly aligned. This would greatly increase the size
of the code, but might be worthwhile in terms of speedup. With 16 bit
or 32 bit instructions, the interpreter would not need to extend the
bytes before using them to index into the compiler-generated table of
offsets for the case statement which is the core of the engine.

Looking at the 68000 assembler code my compiler generated for the
bytecode interpreter, there are a couple of instructions, per bytecode
instruction executed, that could be removed. With larger opcodes (no
longer just a byte), the opcodes could all be multiples of 2 (the
likely size of the entries in the compiler-generated jump table), so
that the shift generated by the compiler is not needed. Its not clear
that a compiler could ever be convinced of this however, since it
likely doesn't check to see that all possible indexes are multiples of
2, and likely will also insist on checking for all of the odd
values. Since all bytecode is generated by the system, I could assume
that only valid bytecodes are present, and not test the range of the
bytecodes.  Again, this is difficult to convince a compiler of.

--
Chris Gray     cg at ami-cg.GraySage.Edmonton.AB.CA




More information about the MUD-Dev mailing list