Enhanced C#
Language of your choice: library documentation
|
Encapsulates LLLPG, the Loyc LL Parser Generator, which generates LL(k) recursive-descent parsers. More...
Encapsulates LLLPG, the Loyc LL Parser Generator, which generates LL(k) recursive-descent parsers.
LLLPG is a new LL(k) parser generator under the umbrella of the Loyc project (http://loyc.net).
LLLPG generates recursive-descent parsers for LL(k) grammars. It is designed for parsing computer languages, not natural languages. It also it supports "syntactic predicates" which are zero-width syntactic assertions, and "semantic predicates" which are arbitrary expressions.
The LLParserGenerator class is the core engine. It generates parsers in the form of a Loyc tree, which can be printed out as C# code. Look at the documentation of the Run() method for an overview of how the LLLPG engine works.
Note: the input to LLLPG is usually provided in the form of LES/EC# source code. In that case, there is no need to use this class directly. The source code of Program.Main shows how to invoke LLLPG as a macro via the LeMP.Compiler.
For more information about how to use LLLPG, read http://www.codeproject.com/Articles/664785/A-New-Parser-Generator-for-Csharp
Nested classes | |
class | ComputeNext |
Gathers a list of all one-token transitions starting from a single position. Also gathers any and-predicates that must be traversed before completing a transition. More... | |
class | GenerateCodeVisitor |
Directs code generation using the visitor pattern to visit the predicates in a rule. The process starts with Generate(Rule). More... | |
class | GetCanonical |
Computes the "canonical" interpretation of a position for prediction purposes, so that ConsolidateDuplicatePositions can detect duplicates reliably. Call Do() to use. More... | |
class | GrammarPos |
Represents a location in a grammar: a predicate and a "return stack" which is a so-called persistent singly-linked list. This type is used within Transition. More... | |
class | KthSet |
Represents the possible interpretations of a single input character, in terms of transitions in the grammar. More... | |
class | PredictionAnalysisVisitor |
Performs prediction analysis using the visitor pattern to visit the predicates in a rule. The process starts with Analyze(Rule). More... | |
class | Transition |
Represents a position in a grammar (GrammarPos) plus the set of characters that leads to that position from the previous position. This is a single case in a KthSet. More... | |
Public fields | |
int | DefaultK = 2 |
Specifies the default maximum lookahead for rules that do not specify a lookahead value. More... | |
bool | NoDefaultArm = false |
Normally, the last arm in a list of alternatives is chosen as the default. For example, in ("Foo" | "Bar"), the second branch is taken unless the input begins with 'F'. However, if this flag is true, there is no default arm on Alts unless one is specified explicitly, so a special error branch is generated when none of the alternatives apply. This increases code size and decreases speed, but the generated parser may give better error messages. More... | |
bool | FullLLk = false |
Enables full LL(k) instead of "partly approximate" lookahead. More... | |
int | Verbosity = 0 |
Gets or sets the verbosity level. Verbose output can help you debug grammars that don't produce the expected code. More... | |
bool | AddComments = true |
Whether to insert Specifies the default maximum lookahead for rules that do not specify a lookahead value. More... | |
bool | AddCsLineDirectives = true |
Whether to add #line directives for C# before and after user actions. More... | |
Properties | |
IMessageSink | Sink [get, set] |
Called when an error or warning occurs while parsing a grammar or while generating code for a parser. Also called to print "verbose" messages. More... | |
IPGCodeGenHelper | CodeGenHelper [get, set] |
Public Member Functions | |
LLParserGenerator (IPGCodeGenHelper csg, IMessageSink sink=null) | |
void | AddRules (params Rule[] rules) |
void | AddRules (IEnumerable< Rule > rules) |
void | AddRule (Rule rule) |
LNode | Run (ISourceFile sourceFile) |
Generates a braced block of code {...} for the grammar described by the rules that were previously added to this object with AddRule or AddRules(Rule[]). More... | |
Static Public Member Functions | |
static Dictionary< Symbol, Rule > | AddRulesToDict (IEnumerable< Rule > rules, Dictionary< Symbol, Rule > dict=null) |
Protected Member Functions | |
KthSet[] | ComputeFirstSets (Alts alts) |
KthSet[] | ComputeNextSets (List< KthSet > previous, Alts currentAlts) |
KthSet | ComputeNextSet (KthSet previous, bool addEOF) |
void | MakeCanonical (KthSet next) |
Protected fields | |
ISourceFile | _sourceFile |
WList< LNode > | _classBody |
IPGCodeGenHelper | _helper = new IntStreamCodeGenHelper() |
ComputeNext | _computeNext = new ComputeNext() |
GetCanonical | _getCanonical = new GetCanonical() |
Protected static fields | |
static Severity | Warning = Severity.Warning |
static Severity | Error = Severity.Error |
static Severity | Verbose = Severity.Verbose |
|
inline |
Generates a braced block of code {...} for the grammar described by the rules that were previously added to this object with AddRule or AddRules(Rule[]).
sourceFile |
[This may be outdated, TODO: review it]
By far the greatest difficulty in this process is generating prediction code when the grammar branches: (x | y | z
). Since this class creates LL(k) parsers without memoization or implicit backtracking, it relies on prediction trees to correctly decide in advance which branch to follow.
The following kinds of grammar elements require prediction:
a | b
(which is equivalent to a / b
): prediction chooses between a and b a?
: prediction chooses between a and whatever follows a? a*
: prediction chooses between a and whatever follows a* (a | b)*:
prediction chooses between three alternatives (a, b, and exiting the loop). (a | b)?:
prediction chooses between three alternatives (a, b, and skipping both). a+
: exactly equivalent to a a*
All of these are based on an Alts object.
Let's look at a simple example of the prediction code generated for a rule called "Foo":
By default, to make prediction more efficient, the last alternative is assumed to match if the others don't. So when a
doesn't match, b
is called even though it has not been verified to match yet. This behavior can be changed by setting NoDefaultArm=true.
Alternatively, you can select the default using the 'default' keyword, which controls the Alts.DefaultArm property, e.g.
In simple cases like this one that only require LL(1) prediction, prediction and matching are merged into a single if-else chain. In more complicated cases, goto statements may be used to avoid code duplication (ANTLR uses pairs of if-else or switch statements instead, but I chose to use gotos because the generated code will be faster.) The if-else statements are the "prediction" part of the code, while the calls to a() and b() are the "matching" part.
Here's another example:
A kleene star (*) always produces a "for(;;)" loop, while an optional item may produce a "do ... while(false)" pseudo-loop in some circumstances (but this case is too simple to require it). Here there are two separate prediction phases: one for the outer loop (a | b? 'c')*
, and one for b?
.
In this example, the loop appears at the end of the rule. In some such cases, the "follow set" of the rule becomes relevant. In order for the parser to decide whether to exit the loop or not, it may need to know what can follow the loop. For instance, if ('a' 'b')*
is followed by 'a'..'z' 'c', it is not possible to tell whether to stay in the loop or exit just by looking at the first input character. If LA(0) is 'a', it is necessary to look at the second character LA(1); only if the second character is 'b' is it possible to conclude that 'a' 'b' should be matched.
Therefore, before generating a parser one of the steps is to build the follow set of each rule, by looking for places where a rule appears inside other rules. A rule is not aware of its current caller, so it gathers information from all call sites and merges it together. When a rule is marked "public", it is considered to be a starting rule, which causes the follow set to include $ (which means "end of input").
The fact that LLLPG is aware of follow sets and the differences between alternatives, and the fact that its generated parsers do not normally backtrack, makes LLLPG's LL(k) parsing tecnique fundamentally different from another popular parsing technique, PEG. The documentation of LLParserGenerator explains further.
Here's an example that needs more than one character of lookahead:
Here, the prediction and matching phases are merged for the second alternative, but separate for the first alternative (because it is chosen in two different places in the prediction logic). Notice that the matching for alt 2 starts with Match()
twice, with no arguments, but is followed by MatchRange('a', 'z')
. This demonstrates communication from prediction to matching: the matching phase can tell that LA(0) is confirmed to be 'x', and LA(1) is confirmed to be '0'..'9', so an unconditional match suffices. However, nothing is known about LA(2) so its value must be checked, which is what MatchRange() is supposed to do.
In some cases, LA(0) is irrelevant. Consider this example:
Here, the first character of both alternatives is always '(', so looking at LA(0) doesn't help choose which branch to take, and prediction skips ahead to LA(1).
An and-predicate specifies an extra condition on the input that must be checked. Here is a simple example:
This example says that '0'..'9' is only allowed if the expression flag
evaluates to true, otherwise 'a'..'z' is required. LLPG, however, gives and-predicates lower priority, and always inverts the order of the testing: it checks for '0'..'9' first, then checks flag
afterward. I chose to make LLPG work this way because in general, and- predicates can be much more expensive to check than character sets; if one of the alternatives rarely runs, it would be wasteful to check an expensive and-predicate before checking if the input character could possibly match. Therefore, the generated code looks like this:
If you really need to make the and-predicate run first for some reason, I dunno. I got nothin'. Complain to me every month until I implement something, maybe.
A generated parser performs prediction in two interleaved parts: character-set tests, and and-predicate tests. In this example,
The code will look like this:
Here you can see the interleaving: first the parser checks LA(0), then it checks the and-predicate, then it checks LA(1).
LLPG (let's call it 1.0) does not support any analysis of the contents of an and-predicate. Thus, without loss of generality, these examples use semantic predicates &{...} instead of syntactic predicates &(...); LLPG can't "see inside them" either way.
Even without analyzing the contents of an and-predicate, they can still make prediction extremely complicated. Consider this example:
In this example, the first branch requires 'a' and 'd' to be true, there's a pair of zero-width alternatives that require 'b' or 'c' to be true, {B()} must be executed if 'b' is true, 'e' must be true if LA(0) is ('e'|'E'), 'f' must be true if LA(0) is 'f' and no condition is required for 'F'. The second branch also allows 'e' or 'f', provided that 'c' is true, but requires 'f' if LA(0) is 'e'.
LLLPG appears to handle this case correctly.
References Loyc.Localize.Localized().
bool Loyc.LLParserGenerator.LLParserGenerator.AddComments = true |
Whether to insert Specifies the default maximum lookahead for rules that do not specify a lookahead value.
bool Loyc.LLParserGenerator.LLParserGenerator.AddCsLineDirectives = true |
Whether to add #line directives for C# before and after user actions.
int Loyc.LLParserGenerator.LLParserGenerator.DefaultK = 2 |
Specifies the default maximum lookahead for rules that do not specify a lookahead value.
bool Loyc.LLParserGenerator.LLParserGenerator.FullLLk = false |
Enables full LL(k) instead of "partly approximate" lookahead.
LLLPG's standard disambiguation mode is similar to the "linear approximate" lookahead present in the ANTLR v2 parser generator. The original linear approximate lookahead fails to predict the following case correctly:
LLLPG has no problem with this case. However, LLLPG's "somewhat approximate" lookahead still has problems with certain cases involving nested alternatives. Here's a case that it can't handle:
Basically here's what goes wrong: LLLPG detects that both alternatives can start with 'a' or 'b'. The way it normally builds a prediction tree is by creating a test for the common set between two alternatives:
Then, inside that "if" statement it adds a test for LA(1). Sadly, LLLPG discovers that if (la1 == 'a' || la1 == 'b'), both alternatives still apply. Thus, it can't tell the difference between the two and gives up, picking the first alternative unconditionally and printing a warning that "Branch 2 is unreachable".
To fix this, LLLPG must figure out that it should split the LA(0) test into two separate "if" clauses. I've figured out how to do this, but the new code is experimental, it creates subtly different results than standard prediction, which causes the test suite to fail, it sometimes uses too many branches that are not merged properly, I suspect it might be substantially slower at code generation in some cases, and finally I am worried that it will make the generated code much larger sometimes (although I have not actually found or seen such a case).
So, full LL(k) is disabled by default, but you can enable it if you encounter a problem like this.
bool Loyc.LLParserGenerator.LLParserGenerator.NoDefaultArm = false |
Normally, the last arm in a list of alternatives is chosen as the default. For example, in ("Foo" | "Bar"), the second branch is taken unless the input begins with 'F'. However, if this flag is true, there is no default arm on Alts unless one is specified explicitly, so a special error branch is generated when none of the alternatives apply. This increases code size and decreases speed, but the generated parser may give better error messages.
When this flag is false, an error branch is still generated on a particular loop if requested with Alts.ErrorBranch.
int Loyc.LLParserGenerator.LLParserGenerator.Verbosity = 0 |
Gets or sets the verbosity level. Verbose output can help you debug grammars that don't produce the expected code.
Level 1 verbosity prints simplified prediction trees in each rule, and the follow sets of each rule. Level 2 verbosity prints prediction trees before they are simplified, and before they have been extended to handle unspecified cases (e.g. if your rule says 'a' 'b' | 'c' 'd', the unspecified cases are all other possible inputs.) Level 3 verbosity prints level 1 and 2 information.
|
getset |
Called when an error or warning occurs while parsing a grammar or while generating code for a parser. Also called to print "verbose" messages.
The parameters are (1) a Node that represents the location of the error, or Node.Missing if the grammar was created programmatically without any source code backing it; (2) a predicate related to the error, or null if the error is a syntax error; (3) "Warning" for a warning, "Error" for an error, or "Verbose"; and (4) the text of the error message.