In a production, - binds one PEG clause to a non-terminal. Causes the - expression to be appanded to the far left or right, depend on where the - was. A non-terminal may only have two - operators: one before and one after.
When a non-terminal, call it N, with one or two - operators attached is reached during parsing a production, call it P, we call the expressions attached to the - operators E and proceed as follows:
Note that neither the original nor modified forms of N are evaluated as part of this process. To modify and then evaluate N, just do "E-N-E N".
To create a new, 'empty' alternative in N, just use "N-/&.".
These rules do not allow for a recursive use of the - operator, despite rules that do allow it being conceivable, because I don't think that's necessary and it adds a lot of complexity. You can always do something like:
A <- [some rule] / B B <- a-A-b
This was picked as an example because it is was a specific example of CFG not parseable by any PEG in the master's thesis that originally developed (the predecessor of) PEGs.
S <- (a / b) A !. A <- B C !. / C-(a / b) / (a / b) A B <- a C <- (a / b)
This is actually quite similar to the previous example. The second production is there solely to add elements to C.
No change after 'a' is seen. After 'ab' is seen:
S <- (a / b) A !. A <- B C !. / C-(a / b) / (a / b) A B <- a C <- (a / b) (a / b)
No change after 'aba', 'abaa' or 'abaab' is seen. After 'abaabb' is seen:
S <- (a / b) A !. A <- B C !. / C-(a / b) / (a / b) A B <- a C <- (a / b) (a / b) (a / b)
At this point, the "(a / b) A" expressions have already consumed 3 (a / b) characters, and we are back at "B C !." again. Continuing on, this grammar will successfully parse, without further modification, the remaining string if it consists of an 'a' followed by three (a / b) characters. Failure in this grammar can really only occur at string termination, or possible if a non-(a / b) character is introduced.
This was picked as an example because it is a canonical example of a non-CFG, but with this method can (I think) be parsed in linear time. Please not that the solution for this language in Bryan Ford's thesis is much better; this is just an example.
S <- a A !. A <- B C / B-b / C-c / a A B <- b C <- c
The goal here is for each initial 'a' in the input string to cause the activation of "B-b" and "C-c" exactly once each. Then the 'a' is eaten, and A is called again to try to match the whole string.
No change after 'a' is seen. After 'aa' is seen:
S <- a A !. A <- B C / B-b / C-c / a A B <- b b C <- c c
After 'aaa' is seen:
S <- a A !. A <- B C / B-b / C-c / a A B <- b b b C <- c c c
At this point, the "a A" productions have eaten 3 'a' characters and we are back at "B C". This grammar can only match 'aaabbbccc' unless more 'a' characters come, because only the presence of an 'a' can cause further expansion.
This is a classic example of a place where semantic analysis is required in language processing. Please forgive my sloppiness with quoting issues.
Start <- (Variable Spacing / Typedef Spacing)* Variable <- Type Spacing Identifier Spacing SEMI Identifier <- !Type !"typedef" [a-z][a-zA-Z0-9_]* Type <- "char" Spacing / "int" Spacing / UserTypes Spacing SEMI <- ';' Spacing <- ' '+ Typedef <- "typedef" Spacing UserTypes-/&. TypedefIdentifier Spacing SEMI TypedefIdentifier <- TypedefIdentifierChar TypedefIdentifierChar* TypedefIdentifierChar <- a UserTypes-a / b UserTypes-b / c UserTypes-c ... / z UserTypes-z / A UserTypes-A ... / Z UserTypes-Z / 0 UserTypes-0 ... / 9 UserTypes-9 UserTypes <- ()
The TypedefIdentifierChar production is a bit messy, but there's no way around it without extending the functionality of the - operator.
The target text is:
char foo; typedef mytype; mytype baz;
Nothing special happens in the parsing of the first line. Nothing special happens parsing 'typedef '. As the 'm' is being parsed through "UserTypes-/&.", the (partial) grammar looks like this:
Typedef <- "typedef" Spacing UserTypes-/&. TypedefIdentifier Spacing SEMI TypedefIdentifier <- TypedefIdentifierChar TypedefIdentifierChar* TypedefIdentifierChar <- a UserTypes-a ... / 9 UserTypes-9 UserTypes <- () / &.
Continuing with the parse of 'm' into TypeIdentifier, the partial grammar looks like this:
Typedef <- "typedef" Spacing UserTypes-/&. TypedefIdentifier Spacing SEMI TypedefIdentifier <- TypedefIdentifierChar TypedefIdentifierChar* TypedefIdentifierChar <- a UserTypes-a ... / 9 UserTypes-9 UserTypes <- () / &. m
Once the entirety of 'mytype;' has been parsed, the whole grammar looks like this:
Start <- (Variable Spacing / Typedef Spacing)* Variable <- Type Spacing Identifier Spacing SEMI Identifier <- !Type !"typedef" [a-z][a-zA-Z0-9_]* Type <- "char" Spacing / "int" Spacing / UserTypes Spacing SEMI <- ';' Spacing <- ' '+ Typedef <- "typedef" Spacing UserTypes-/&. TypedefIdentifier Spacing SEMI TypedefIdentifier <- TypedefIdentifierChar TypedefIdentifierChar* TypedefIdentifierChar <- a UserTypes-a ... / 9 UserTypes-9 UserTypes <- () / &. m y t y p e
This easily allows parsing of "mytype baz;".
The = operator is a modified version of the - operator, and is motivated by how unwieldy the TypedefIdentifierChar production of the last example is.
In a production, = binds the result of one PEG clause to a non-terminal. Usage is for counted repitition, essentially. Causes the addition to the body of the non-terminal with the result of the = expression appened to the far left or right, depending on where the = was. A non-terminal may only have two = operators: one before and one after.
When a non-terminal, call it N, with one or two - operators attached is reached during parsing a production, call it P, we call the expressions attached to the = operators E and proceed as follows:
Note that neither the original nor modified forms of N are evaluated as part of this process. To modify and then evaluate N, just do "E=N=E N".
To create a new, 'empty' alternative in N, just use "N=/&.".
These rules do not allow for a recursive use of the = operator, despite rules that do allow it being conceivable, because I don't think that's necessary and it adds a lot of complexity.
This is a classic example of a place where semantic analysis is required in language processing. Please forgive my sloppiness with quoting issues.
Start <- (Variable Spacing / Typedef Spacing)* Variable <- Type Spacing Identifier Spacing SEMI Identifier <- !Type !"typedef" [a-z][a-zA-Z0-9_]* Type <- "char" Spacing / "int" Spacing / UserTypes Spacing SEMI <- ';' Spacing <- ' '+ Typedef <- "typedef" Spacing UserTypes=/Identifier Spacing SEMI UserTypes <- ()
This works just like the example in the last section, but has a much cleaner definition.
Copyright (c) 2004, Robin Lee Powell. This text is available under the MIT License. The ideas herein you can do with what you like, but I'd appreciate attribution.