Using Regular Expressions

Several Advanced Workflows Actions, notably the Find Text and Replace Text actions, allow the use of "regular expressions." Regular expressions are powerful notations that allow a wide range of text searching using formulas specific to string manipulation. This article describes the regular expression syntax used in AWE, whether you are searching for text or replacing text.

Searching for Text

Text can be found in a string using a regular expression by specifying a "match expression". A Match expression operates on a single line of text at one time. No match can span multiple lines of text. Match regular expressions are composed of the following:

Period ('.') Matches any single character except newline. A
Caret (^) Matches at the beginning of a line only. A ^ occurring ANYWHERE in the match expression (except within a character class) is interpreted in this manner. This allows meaningful use of ^ in combination with grouping or alternation (see below).
Dollar sign ($) Matches at the end of a line only. As with ^ the $ character retains its special meaning anywhere within the expression (except in a character class).
Backslash (\) Followed by a single character matches that character. For example, '\*' matches an asterisk, '\\' matches a backslash, '\$' matches a dollar sign, etc.

The following sequences have special meaning:

\s space (ASCII #32)
\t tab (ASCII #9)
\b backspace (ASCII #8)
\r return (ASCII #13)
\l linefeed (ASCII #10)
\n newline (#13 followed by #10)
\p pipe character |
\w word delimiter. Matches any of \t\s!()*+,-./:;=?@[\]^`{|}~
\h hex character. Matches any of 0123456789ABCDEF

The special characters above should be used to produce instances of blanks and tabs. Case is ALWAYS significant when using the special characters. Thus \s matches a space while \S matches a capital letter S. A single character not otherwise endowed with special meaning matches that character. Thus z matches a single instance of the letter z.

A string enclosed in brackets [] specifies a character class. Any single character in the string is matched. For example, [abc] matches an a, b, or c. Ranges of ASCII letters and numbers can be abbreviated as, for example, [a-z0-9]. If the first symbol following the [ is a caret (^) then a negative character class is specified. In this case, the string matches all characters EXCEPT those enclosed in the brackets. For example, [^

The special characters defined above may be used inside of character classes with the exception of \n, \w and \h, which are shorthand for their own character classes. If the characters - or ] are to be used literally inside of a character class, they should be preceded by the escape character \. Note that *?+(){}!^$#& are not special characters when found inside a character class.

Using Closures

A regular expression followed by * matches zero or more matches of the regular expression. This is referred to as a closure. Thus ba*b matches the string

A regular expression followed by a + matches one or more matches of the regular expression. This is another type of closure. In this case ba+b will not match

A regular expression followed by a ? matches zero or one matches of the regular expression. This is another closure. Here, ba?b will match

Concatenated Expressions

Two regular expressions concatenated match a match of the first followed by a match of the second. Thus (abc)(def) matches the string

Alternation

Two regular expressions separated by | match either a match of the first or a match of the second. This is referred to as alternation. Any number of regular expressions can be strung together in this way. Alternation matches are tested in order from left to right, and the first match obtained is used. Then the remaining alternate expressions are skipped over.

Grouping Expressions

A regular expression enclosed in parentheses () matches a match of the regular expression. Parentheses are used to provide grouping, and may be nested to arbitrary depth. Open and close parentheses must be balanced. For example, the following two expressions are not equivalent, and the second probably expresses what was intended:

PROCEDURE|FUNCTION

(PROCEDURE)|(FUNCTION)

The first expression is equivalent to

PROCEDUR(E|F)UNCTION

The second expression matches either of the two words in their entirety.

Tagged Matches

A regular expression enclosed in curly braces {} forms a tagged match word. Whatever was matched within the braces may be referred to by a Replace expression in a manner to be described. Tagged match words may not be nested. Open and close braces must be balanced. A maximum of nine tagged match words can be referenced by the Replace expression. Note that the use of curly braces in expressions is meaningless. However, these expressions share an expression interpreter with the Match expressions, so no exception is raised. For example, consider the expression

b{a*}b.

If the string being tested is 'bab', then the tagged match word contains a single 'a'. If the string being tested is '

Order of Precedence

Regular expressions are interpreted from left to right. The order of precedence of operators at the same parenthesis level is [], then *+!, then |, and then concatenation.

Tag braces are interpreted strictly from left to right and do not control precedence in any way. The first tagged match word found is given a tag of 1, the second a tag of 2, and so on up to a maximum tag of 9. The tag number that each word receives is based on when it is encountered in the line. If tags are skipped over as a result of alternation, then any remaining tags in a line receive shifted tag numbers. For example, consider the expression:

(FUNCTION)|({PROCEDURE})\s+{[^\s(]+}

If a line contains the word PROCEDURE then the word following PROCEDURE has a tag number of 2. If a line contains the word FUNCTION, then the word following FUNCTION has a tag number of 1. It is up to the user to take advantage of this behavior. Generally, it is good practice to surround an entire set of alternates with tag markers:

{(FUNCTION)|(PROCEDURE)}\s+{[^\s(]+}

Replacing Text

Replace regular expressions are constructed the same way as Match regular expressions, but the number of operators is reduced. The replacement process occurs in the following manner:

The Match expression finds a string of text that starts at the left-most position in the input line that matches, and continues to the right-most position that matches. The string of matched text is operated upon by the Replace expression. The Match expression is then tried again on the input, starting at the first position beyond the previous match string. This recurs until the end of line is found.

Replace expressions are composed of the following:

No spaces: The regular expression may NOT contain any blank space. The special characters below should be used to produce instances of blanks, tabs and the null expression.
Null replace: If a null Replace expression is desired, the special symbol \z is used to indicate a null expression. Null Replace expressions are used to delete text strings.
Single character: A single character not otherwise endowed with special meaning.
Backslash (\): Followed by a single character matches that character. For example, '\*' matches an asterisk, '\\' matches a backslash, '\$' matches a dollar sign, etc. A '\' followed by a single character sends that character to the output. In this way a '\&' writes an ampersand and '\\' writes a backslash.

The following sequences have special meaning:

\s space (ASCII #32)
\t tab (ASCII #9)
\b backspace (ASCII #8)
\r return (ASCII #13)
\l linefeed (ASCII #10)
\n newline (#13 followed by #10)
\z null expression

Unless a newline combination is explicitly matched in the Match expression, it is not necessary to explicitly specify

Another special case occurs when '\' is followed by a single digit in the range of 1 through 9. In this case the tagged match word found by the Match expression is sent to the output. If a tagged match word for that tag number was not defined, or if the tagged match word doesn't match anything, then nothing is output. The tagged match words cab be output in any order and can be repeated any number of times.

An ampersand ('&') appearing in the Replace expression causes all text matched by the match expression to be sent to the output. The ampersand can appear in the Replace expression as many times as desired.

Examples:

Return the first letter of the input line:

^[A-B]

Return everything in the line except for the first word

\s.*

Return the number 123 from the text: This is the number 123

[0-9]