[MUD-Dev] Naming and Directories?

Hans-Henrik Staerfeldt hhs at cbs.dtu.dk
Wed Mar 17 11:28:42 New Zealand Daylight Time 1999


On Wed, 17 Mar 1999, Jon A. Lambert wrote:

> On 16 Mar 99, Mik Clarke wrote:
> > Ummm. Yick. 26 4 byte pointers by, say 16 characters long just to ind=
ex 100-200 names?
> > Even with a sensible implementation you're going to end up wasting 90=
% of the storage you
> > allocate (and a ten letter name means you have to allocate 1K pointer=
s, each different
> > name then requires another 900 bytes or so).
>=20
> But you have changed your requirements above from speed to memory.=20
> No fair. ;)
>=20
> The alphabet trie above has some pros and cons:
>=20
> 1) avoids the overhead of string hashing=20
> 2) provides intrinsic support for abbreviations
>=20
> 1) based on English and may not support blanks, puntuation, or other=20
> characters   (of course this can be altered or enforced)
> 2) distribution of names may tend to cluster in certain nodes=20
>=20

Some additional pros and cons on a search trie (for large problems) are:

pro : distribution of names may tend to cluster in certain nodes
 its a pro, because storing names with the same prefix does not make the
 trie grow linearly with name length, so storing "green bottle" and=20
 "green dragon" would require the same space for "green ", only "bottle"
 and "dragon" would be distinct.

pro : for the large _static_ case, you can easily store the trie on disk,
 using file pointers instead of memory pointers. I did this for lookups=20
 in a large genetic database (Genbank, size ~11 Gb) with much success,=20
 the trie (of ~3 million keywords) taking up 82Mb. Note that since
 the namespace is well filled (or clustered), the memory usage is about=20
 30 bytes pr. word, since the trie 'reuses' the common prefixes (words=20
 are approx 6-10 characters long). (the trie is a bit compressed, so=20
 lookup actually takes O(m * sigma), m being the keyword length, with=20
 O(m) disk reads.)

con : it _also_ takes time to insert/delete entries in the trie, so
 you may get another drag on load, as you ease lookup time. You really
 should only use a trie (compared to linear search), if=20
 #lookups*#objects > #movements*sizeof(names) for the locality in=20
 which you use the trie. Most usefull, will probably be to apply it to=20
 global lookups only, since 'movement' here is limited to=20
 creation / deletion.=20


Some optimizations i have heard about (but not trie-d):

memory saver: Instead of a full 26 letter alfabet, you could=20
 cluster individual letters, say 'ab' 'cd' ... such that the number of=20
 pointers is reduced, making a (small)hash or linear search in the end
 (of a greatly reduced number of objects) to see which actually match.

As mentioned earlier (i forget who *blush*): compressing the trie (long=20
 chains that does not branch) is a sure memory saver.
=20
Another compression is possible (it has some girls name): if some (any)
 subtree of the trie is fully 'filled', say f.inst. that the 4 letter=20
 prefixes of your used names covers all combinations, then you can=20
 compress this to a simple list, doing direct lookup, instead of 4=20
 lookups.


Hans Henrik St=E6rfeldt         | =20
email: bombman at diku.dk        |  voice:      +45 40383492=20
  hhs at cbs.dtu.dk              |  voice work: +45 45252425
phone-mail:                   |  address:
  40383492 at sms.tdm.dk         |       Hans Henrik St=E6rfeldt,
WWW-home                      |       Dybendalsvej 74 2. th,
  http://www.cbs.dtu.dk/hhs/  |       2720 Vanl=F8se, Danmark.
                              |
Student of Computer Science   | Scientific programmer at Center for
  and Information Psychology. |   Biological Sequence Analysis,
  at University of Copenhagen |   Technical University of Denmark.




_______________________________________________
MUD-Dev maillist  -  MUD-Dev at kanga.nu
http://www.kanga.nu/lists/listinfo/mud-dev




More information about the MUD-Dev mailing list