This code is released into the public domain; you may do anything you wish with it.
The regular-expression engine here is inspired by the one in the book “Software Tools in Pascal” by B.W. Kernighan and P.J. Plauger, Addison-Wesley, 1981. (ISBN 0-201-10342-7). It was originally written in PDP-11 assembly and Pascal (in 1984) to run on a Terak 8510a running UCSD Pascal; the Pascal files were then converted to a homebrew language called P2 for use on a Terak 8510a running a much-modified version of UCSD Pascal. In 1985, it was rewritten in 68000 assembly language and C for use on the Atari ST; modifications, tweaks, and enhancements have been applied over the years since then. The current version will probably run on any 680x0 machine that has a C compiler and a motorola-format 680x0 assembler on it.
Installing the regex library on your system is fairly simple;
move libregex.a
into your library directory and move regex.h
into your include-file directory.
char *compile(char *pattern) | - compile regular expressions |
char *execute(char *start, char *end) | - match strings with compiled regular expressions |
extern char *RElastp; | - pointer at last char matched |
extern char *REargs[]; | - arguments array |
This is a simple regular-expression compiler/interpreter inspired
by the one in Kerningham & Plauger’s Software Tools in
Pascal. To use it, you feed compile()
a regular expression,
then execute()
each text object you want to do searches on.
Compile()
understands regular expressions containing the
following tokens:
=$
=
matches end of line (end of text or \n
)
=^
=
matches start of line (start of text or previous char is \n
)
=.
=
any character
=[
xxx]
=
characters matching or not matching a set of characters. [abc]
,
for example, matches a single a, b, or c. [a-z]
matches any
lower- case alphabetic (to match a literal -, it must be the
first or last character in the set. If the first character in
the set is a ^
, it means match anything except characters found
in the set; the only way you can match a ]
character is to have
it be the first character in the set (to match ]
and -
, you’d
need to do []-]
)
=*
=
match the previous token 0 or more times. Doesn’t work on ^
,
$
, *
, or +
, though.
=+
=
match the previous token 1 or more times.
=\<
=
matches the beginning of a word.
=\>
=
matches the end of a word.
=\(
=
indicates the beginning of a text argument, Arguments may not
be nested, and there are only 10 arguments allowed.
=\)
=
indicates the end of a text argument.
=\
digit=
matches a previously matched argument (see \(
and \)
)
=\
=
removes special meaning from all other tokens (for example,
\*
matches a *
.)
={
=
introduces an arbitrary closure. An arbitrary closure is defined
like {
minumum,
maximum}
and matches the previous token at
least minumum times but not more than maximum times. (For
example, to match the previous token zero or more times, use
{0,0}. Note that this is the *
token – for +
, it’s {1,0} -
using 0 for the maximum means it can match an infinite number
of times.)
If compile()
successfully compiles the string you give it, it
will return a null pointer; otherwise it will return a string
describing the reason that it couldn’t compile the expression.
The compiled pattern can be found in the string _REpattern
, which
is a pascal-style string (1 word length, then arbitrary data out
to that length.)
Execute()
tries to match the last pattern fed to compile()
against whatever text string you feed it. It doesn’t care what
the format of that string is, just that you give it a pointer to
the start and end of the string. If it finds a match to the pattern
in the string, it returns the start of the pattern; otherwise it
returns a null pointer.
On a successful match, execute() fills in a couple of globals -= it fills in RElastp with a pointer at the last character matched and REargs with pointers to the start and end of each argument contained in the pattern.
To match “hello” on stdin:
char *msg;
char line[200];
char *args[20]; /* 10 arguments, 2 pointers each */
char *p;
msg = compile("hello");
if (msg != (char *)0) {
fprintf(stderr, "RE: %s\n", msg);
exit(0);
}
while (gets(line))
if (p=execute(line, line+strlen(line)-1))
puts(line);
jessica l. parsons (Orc), 25-Mar-1991
=execute.s
=
The regular expression scanner (68k assembly language.)
=regex.h
=
library headerfile.
=re_token.h
=
headerfile for compile.c
=compile.c
=
the regular-expression compiler.
=gre.c
=
a simple grep for seeing how it works.
=makefile
=
a makefile for it.