[MUD-Dev] Intelligent WebGlimpse archive searching at Kanga.Nu (was Re: [MUD-Dev] Info about different skill systems)

J C Lawrence claw at under.engr.sgi.com
Wed Jan 6 16:22:29 New Zealand Daylight Time 1999


On Sat, 02 Jan 1999 01:17:56 +0100 
Emil Eifrem<emil at prophecy.lu> wrote:

> I have checked the archives, but didn't find any real explanations
> about it. I may have missed it though, the lack of boolean search
> options in the search engine tend to make exhaustive searches
> tedious at best. 

Aaaargh.  This is an old flunk of mine.  I'd meant to add a help
page detailing how to do intelligent searches, and err, forgot.  You 
can use boolean logic in searches with WebGlimpse.

Quoting from http://glimpse.cs.arizona.edu/glimpsehelp.html#sect11
(slightly reformatted):

--<cut>--

PATTERNS

glimpse supports a large variety of patterns, including simple
strings, strings with classes of characters, sets of strings, wild
cards, and regular expressions. (See LIMITATIONS.)

Strings

Strings are any sequence of characters, including the special
symbols `^' for beginning of line and `$' for end of line. The
following special characters ( `$', `^', `*', `[', `^', `|', `(',
`)', `!', and `\' ) as well as the following meta characters special
to glimpse (and agrep): `;', `,', `#', `<', `>', `-', and `.',
should be preceded by `\' if they are to be matched as regular
characters. For example, \^abc\ corresponds to the string ^abc\,
whereas ^abc corresponds to the string abc at the beginning of a
line.

Classes of characters

a list of characters inside [] (in order) corresponds to any
character from the list. For example, [a-ho-z] is any character
between a and h or between o and z. The symbol `^' inside []
complements the list. For example, [^i-n] denote any character in
the character set except character `i' to `n'. The symbol `^' thus
has two meanings, but this is consistent with egrep. The symbol `.' 
(don't care) stands for any symbol (except for the newline symbol).

Boolean operations

Glimpse supports an `AND' operation denoted by the symbol `;' an
`OR' operation denoted by the symbol `,', a limited version of a
'NOT' operation (starting at version 4.0B1) denoted by the symbol
`~', or any combination. For example, glimpse `pizza;cheeseburger'
will output all lines containing both patterns. glimpse -F
`gnu;\.c$' `define;DEFAULT' will output all lines containing both
`define' and `DEFAULT' (anywhere in the line, not necessarily in
order) in files whose name contains `gnu' and ends with .c. glimpse
`{political,computer};science' will match `political science' or
`science of computers'. The NOT operation works only together with
the -W option and it is generally applies only to the whole file
rather to individual records. It currently does not work with
approximate matching. Its output may sometimes seem
counterintuitive. Use with care. glimpse -W 'fame;~glory' will
output all lines containing 'fame' in all files that contain 'fame'
but do not contain 'glory'; This is the most common use of NOT, and
in this case it works as expected. glimpse -W '~{fame;glory}' will
be limited to files that do not contain both words, and will output
all lines containing one of them.

Wild cards

The symbol `#' is used to denote a sequence of any number (including
0) of arbitrary characters see LIMITATIONS). The symbol # is
equivalent to .* in egrep. In fact, .* will work too, because it is
a valid regular expression (see below), but unless this is part of
an actual regular expression, # will work faster. (Currently glimpse
is experiencing some problems with #.)

Combination of exact and approximate matching 

Any pattern inside angle brackets <> must match the text exactly
even if the match is with errors. For example, <mathemat>ics matches
mathematical with one error (replacing the last s with an a), but
mathe<matics> does not match mathematical no matter how many errors
are allowed. (This option is buggy at the moment.)

Regular expressions

Since the index is word based, a regular expression must match words
that appear in the index for glimpse to find it. Glimpse first
strips the regular expression from all non-alphabetic characters,
and searches the index for all remaining words. It then applies the
regular expression matching algorithm to the files found in the
index. For example, glimpse `abc.*xyz' will search the index for all
files that contain both `abc' and `xyz', and then search directly
for `abc.*xyz' in those files. (If you use glimpse -w `abc.*xyz',
then `abcxyz' will not be found, because glimpse will think that abc
and xyz need to be matches to whole words.) The syntax of regular
expressions in glimpse is in general the same as that for agrep. The
union operation `|', Kleene closure `*', and parentheses () are all
supported. Currently `+' is not supported. Regular expressions are
currently limited to approximately 30 characters (generally
excluding meta characters). Some options (-d, -w, -t, -x, -D, -I,
-S) do not currently work with regular expressions. The maximal
number of errors for regular expressions that use `*' or `|' is
4. (See LIMITATIONS.)

--<cut>--

WebGlimpse is based atop Glimpse.  All the above instructions for
doing intelligent searches using boolean logic work under WebGlimpse 
at Kanga.Nu.  

> (Did you guys *know* how often you say 'skill tree'?)

Am I allowed to use commas in the final figure?

--
J C Lawrence                              Internet: claw at kanga.nu
(Contractor)                             Internet: coder at kanga.nu
---------(*)                    Internet: claw at under.engr.sgi.com
...Honorary Member of Clan McFud -- Teamer's Avenging Monolith...




More information about the MUD-Dev mailing list