Regular expression syntax in unix
"Regular expressions" are not a part of CSC 209, but since you are
being exposed to software tools which take them as arguments, you might want
to know the syntax.
So here's a quick summary of the syntax of the most basic aspects of regular
expressions, as implemented by many unix tools:
- most characters mean themselves
- a dot means any one character
- something followed by an asterisk (star) means zero or more of that thing
- character lists or ranges in square brackets match one character which is
any of those characters (examples: [a-z] matches any lower-case
letter; [xq] matches either 'x' or 'q';
this can be combined, as in [ac-z] which matches any lower-case
letter except 'b')
- The entire set of characters can be preceded with '^' to
complement the set; e.g. [^a-zA-Z0-9] matches any NON-alphanumeric
character
- a backslash suppresses the special meaning of one following character,
e.g. \. means an actual dot
- parentheses can be used, but may need to be prefaced by backslashes
depending on the program — if so, the backslashes here turn ON the special
meaning of the parentheses
- "extended regular expressions" also permit the vertical bar, to indicate
alternatives. This is the difference between grep and egrep (see the man
pages).
Regular expression notation is not to be confused with the much simpler
(and less powerful)
"glob" notation
in the shell for matching file names. In the "glob" notation, an asterisk
means zero or more of any characters — in the regular expression notation
we would write this as ".*", not "*".
And "*" by itself as a regular expression is a syntax error.