WARNING - I HAVE NO IDEA WHAT THIS IS

Seriously. I'm *guessing* that this is something I was planning to make It may even have been partially implemented in MOZ. But there's no real software here that I'm aware of.

Pego - Oz Parser Generator For PEGs

PEG Extension: The - Operator

In a production, - binds one PEG clause to a non-terminal. Causes the - expression to be appanded to the far left or right, depend on where the - was. A non-terminal may only have two - operators: one before and one after.

Algorithm

When a non-terminal, call it N, with one or two - operators attached is reached during parsing a production, call it P, we call the expressions attached to the - operators E and proceed as follows:

  1. Invalidate any stored values for the N production.
  2. There can optionally be a / character beside the - operator (i.e. a/-A-/a).
  3. If there is a second - operator attached to N, repeat the above for that operator.
  4. Treat the E-N-E term as a success and continue to the next element.

Note that neither the original nor modified forms of N are evaluated as part of this process. To modify and then evaluate N, just do "E-N-E N".

To create a new, 'empty' alternative in N, just use "N-/&.".

These rules do not allow for a recursive use of the - operator, despite rules that do allow it being conceivable, because I don't think that's necessary and it adds a lot of complexity. You can always do something like:

A <- [some rule] / B
B <- a-A-b

Example Of - Usage: The Grammar of S -> aSa | aSb | bSa | bSb | a

This was picked as an example because it is was a specific example of CFG not parseable by any PEG in the master's thesis that originally developed (the predecessor of) PEGs.

S <- (a / b) A !.
A <- B C !. / C-(a / b) / (a / b) A
B <- a
C <- (a / b)

This is actually quite similar to the previous example. The second production is there solely to add elements to C.

Parsing abaabba

No change after 'a' is seen. After 'ab' is seen:

S <- (a / b) A !.
A <- B C !. / C-(a / b) / (a / b) A
B <- a
C <- (a / b) (a / b)

No change after 'aba', 'abaa' or 'abaab' is seen. After 'abaabb' is seen:

S <- (a / b) A !.
A <- B C !. / C-(a / b) / (a / b) A
B <- a
C <- (a / b) (a / b) (a / b)

At this point, the "(a / b) A" expressions have already consumed 3 (a / b) characters, and we are back at "B C !." again. Continuing on, this grammar will successfully parse, without further modification, the remaining string if it consists of an 'a' followed by three (a / b) characters. Failure in this grammar can really only occur at string termination, or possible if a non-(a / b) character is introduced.

Example Of - Usage: The Grammar of ancncn

This was picked as an example because it is a canonical example of a non-CFG, but with this method can (I think) be parsed in linear time. Please not that the solution for this language in Bryan Ford's thesis is much better; this is just an example.

S <- a A !.
A <- B C / B-b / C-c / a A
B <- b
C <- c

The goal here is for each initial 'a' in the input string to cause the activation of "B-b" and "C-c" exactly once each. Then the 'a' is eaten, and A is called again to try to match the whole string.

Parsing aaabbbccc

No change after 'a' is seen. After 'aa' is seen:

S <- a A !.
A <- B C / B-b / C-c / a A
B <- b b
C <- c c

After 'aaa' is seen:

S <- a A !.
A <- B C / B-b / C-c / a A
B <- b b b
C <- c c c

At this point, the "a A" productions have eaten 3 'a' characters and we are back at "B C". This grammar can only match 'aaabbbccc' unless more 'a' characters come, because only the presence of an 'a' can cause further expansion.

Example Of - Usage: C-Style typedefs

This is a classic example of a place where semantic analysis is required in language processing. Please forgive my sloppiness with quoting issues.

Start <- (Variable Spacing / Typedef Spacing)*
Variable <- Type Spacing Identifier Spacing SEMI
Identifier <- !Type !"typedef" [a-z][a-zA-Z0-9_]*
Type <- "char" Spacing / "int" Spacing / UserTypes Spacing
SEMI <- ';'
Spacing <- ' '+

Typedef <- "typedef" Spacing UserTypes-/&. TypedefIdentifier Spacing SEMI
TypedefIdentifier <- TypedefIdentifierChar TypedefIdentifierChar*
TypedefIdentifierChar <- a UserTypes-a 
			    / b UserTypes-b
			    / c UserTypes-c
			    ...
			    / z UserTypes-z
			    / A UserTypes-A
			    ...
			    / Z UserTypes-Z
			    / 0 UserTypes-0
			    ...
			    / 9 UserTypes-9

UserTypes <- ()

The TypedefIdentifierChar production is a bit messy, but there's no way around it without extending the functionality of the - operator.

Parsing Example

The target text is:

char foo;
typedef mytype;
mytype baz;

Nothing special happens in the parsing of the first line. Nothing special happens parsing 'typedef '. As the 'm' is being parsed through "UserTypes-/&.", the (partial) grammar looks like this:

Typedef <- "typedef" Spacing UserTypes-/&. TypedefIdentifier Spacing SEMI
TypedefIdentifier <- TypedefIdentifierChar TypedefIdentifierChar*
TypedefIdentifierChar <- a UserTypes-a 
			    ...
			    / 9 UserTypes-9

UserTypes <- () / &.

Continuing with the parse of 'm' into TypeIdentifier, the partial grammar looks like this:

Typedef <- "typedef" Spacing UserTypes-/&. TypedefIdentifier Spacing SEMI
TypedefIdentifier <- TypedefIdentifierChar TypedefIdentifierChar*
TypedefIdentifierChar <- a UserTypes-a 
			    ...
			    / 9 UserTypes-9

UserTypes <- () / &. m

Once the entirety of 'mytype;' has been parsed, the whole grammar looks like this:

Start <- (Variable Spacing / Typedef Spacing)*
Variable <- Type Spacing Identifier Spacing SEMI
Identifier <- !Type !"typedef" [a-z][a-zA-Z0-9_]*
Type <- "char" Spacing / "int" Spacing / UserTypes Spacing
SEMI <- ';'
Spacing <- ' '+

Typedef <- "typedef" Spacing UserTypes-/&. TypedefIdentifier Spacing SEMI
TypedefIdentifier <- TypedefIdentifierChar TypedefIdentifierChar*
TypedefIdentifierChar <- a UserTypes-a 
			    ...
			    / 9 UserTypes-9

UserTypes <- () / &. m y t y p e

This easily allows parsing of "mytype baz;".

PEG Extension: The = Operator

The = operator is a modified version of the - operator, and is motivated by how unwieldy the TypedefIdentifierChar production of the last example is.

In a production, = binds the result of one PEG clause to a non-terminal. Usage is for counted repitition, essentially. Causes the addition to the body of the non-terminal with the result of the = expression appened to the far left or right, depending on where the = was. A non-terminal may only have two = operators: one before and one after.

Algorithm

When a non-terminal, call it N, with one or two - operators attached is reached during parsing a production, call it P, we call the expressions attached to the = operators E and proceed as follows:

  1. Invalidate any stored values for the N production.
  2. Speculatively attempt to evaluate E at the current position (i.e. as though "&E" was present), storing the string matched by E.

Note that neither the original nor modified forms of N are evaluated as part of this process. To modify and then evaluate N, just do "E=N=E N".

To create a new, 'empty' alternative in N, just use "N=/&.".

These rules do not allow for a recursive use of the = operator, despite rules that do allow it being conceivable, because I don't think that's necessary and it adds a lot of complexity.

Example Of - Usage: C-Style typedefs

This is a classic example of a place where semantic analysis is required in language processing. Please forgive my sloppiness with quoting issues.

Start <- (Variable Spacing / Typedef Spacing)*
Variable <- Type Spacing Identifier Spacing SEMI
Identifier <- !Type !"typedef" [a-z][a-zA-Z0-9_]*
Type <- "char" Spacing / "int" Spacing / UserTypes Spacing
SEMI <- ';'
Spacing <- ' '+

Typedef <- "typedef" Spacing UserTypes=/Identifier Spacing SEMI

UserTypes <- ()

This works just like the example in the last section, but has a much cleaner definition.

Copyright (c) 2004, Robin Lee Powell. This text is available under the MIT License. The ideas herein you can do with what you like, but I'd appreciate attribution.