LLLPG: greedy and nongreedy

This post was imported from blogspot.

(Edit: This blog post was incorporated into the fourth article about LLLPG.)

LLLPG supports "greedy" and "nongreedy" loops and optional items. The "greedy" and "nongreedy" modes refer to the action you prefer to take in case of ambiguity between an exit branch and another branch. Greedy is the default: it means that if the input matches both a non-exit branch and an exit branch, the non-exit branch should be taken. A typical greedy example is this rule for an "if" statement:
  private token IfStmt @[ 
    // "if" "(" Expr ")" Stmt ("else" Stmt)?
    TT.If TT.LParen Expr TT.RParen Stmt (greedy(TT.Else Stmt))?
  ];
Note: currently you need extra parens around greedy(...) or nongreedy(...) due to a parser bug, sorry about that. Without the parens you'll get the error "Unrecognized expression. Treating it as a terminal."

In this case, it is possible that the "if" statement is nested inside another "if" statement. Given that the input could be something like
if (expr) if (expr)
  Stmt();
else
  Stmt();
It is, in general, ambiguous whether to consume TT.Else Stmt or to exit, because the else clause could be paired with the first "if" or the second one. The "greedy" modifier, which must always be paired with a loop or option operator (* + ?) means "in case of ambiguity with the exit branch, do not exit and do not print a warning." Since greedy behavior is the default, the greedy modifier's only purpose is to suppress the warning.

Now, you might logically think that changing "greedy" to "nongreedy" would cause the "else" to match with the outer "if" statement rather than the inner one. Unfortunately, that's not what happens! It does not work because the code generated for IfStmt is not aware of the run-time call stack leading up to it: it does not know whether it is nested inside another IfStmt or not. LLLPG only knows that it could be nested inside another "if" statement; the technical jargon for this is that the follow set of the IfStmt rule includes TT.Else Stmt.

What actually happens is that nongreedy(TT.Else Stmt)? will never match TT.Else, and LLLPG will give you a warning that "branch 1 is unreachable". Not knowing the actual context in which IfStmt was called, LLLPG is programmed to assume that all possible follow sets of IfStmt apply simultaneously, even though in reality IfStmt is called in one specific context. The statically computed follow set of IfStmt, which is based on all possible contexts where IfStmt might appear, includes TT.Else Stmt, and nongreedy uses this information to decide, unconditionally, to let the exit branch win. To put it another way, LLLPG behaves as if IfStmt is always called from inside another IfStmt, when in reality it merely might be. It would be fairly difficult for LLLPG to behave any other way; how is the IfStmt() method supposed to know call stack of other rules that called it?

By the way, I have the impression that the formal way of describing this limitation of LLLPG's behavior is to say that LLLPG supports only "strong" LL(k) grammars, not "general" LL(k) grammars (this is true even when you use FullLLk(true)).

So at the end of a rule, LLLPG makes decisions based on all possible contexts of that rule, rather than the actual context. Consequently, nongreedy is not as useful as it could be. However, nongreedy still has its uses. Good examples include strings and comments:
  token TQString @[ "'''" (nongreedy(_))* "'''" ];
  token MLComment @[ "/*" (nongreedy(MLComment / _))* "*/" ];
This introduces the single underscore _, which matches any single terminal (not including EOF).

The first example defines the syntax of triple-quoted strings '''like this one'''. The contents of the string are any sequence of characters except "'''", which ends the string. The nongreedy modifier is important; without it, the loop (_)* will simply consume all characters until end of file, and then produce errors because the expected "'''" was not found at EOF.

The second example for /* multi-line comments */ is similar, except that (just for fun) I decided to support nested multi-line comments by calling the MLComment rule recursively.

There's actually a bug in TQString, assuming that LLLPG is left in its default configuration. Moreover, LLLPG will not print a warning about it. Got any idea what the bug is? I'm about to spoil the answer, so if you want to give it some thought, do so now before you start glancing at the lower half of this paragraph. Well, if you actually tested this code you might notice that a string like '''one''two''' will be parsed incorrectly, because two quotes, not three, will cause the loop to exit. The reason is that the default maximum lookahead is 2, so two quotes are enough to make LLLPG decide to exit the loop (and then the third Match('\'') in the generated code will fail). To fix this, simply add a [k(3)] attribute to the rule. No warning was printed because half the purpose of nongreedy is to suppress warnings; after all, mixing (_)* with anything else is inherently ambiguous and will frequently cause a warning that you must suppress.

Today I ran into an unfortunate situation in which neither greedy nor nongreedy was appropriate. I was writing a Visual Studio "classifier" for syntax-highlighting of LES, and I decided to use a line-based design where lexing would always start at the beginning of a line. Therefore, I needed to keep track of which lines started inside multi-line comments and triple-quoted strings. Now, if a line starts inside a comment or string, I invoke a special rule that is designed to parse the rest of the comment or string, or stop at the end of the line. Since LES supports nested multi-line comments, I wrote the following rule:
  public token MLCommentLine(ref nested::int)::bool @[ 
    (nongreedy
      ( &{nested>0} "*/" {nested--;}
      / "/*" {nested++;}
      / ~('\r'|'\n')
      ))*
    (Newline {return false;} | "*/" {return true;})
  ];
This rule takes the current comment nesting level as an argument (0 = comment is not nested) and updates the nesting level if it changes during the current line of code. The loop has three arms:
  1. For input of "*/" when comments are nested, reduce the nesting level
  2. For input of "/*", increase the nesting level
  3. For input of anything else (not including a newline), consume one character.
I chose "nongreedy" because otherwise the third branch ~('\r'|'\n') will match the first character of "*/", so the loop would never exit. But this didn't work; LLLPG gave the warning "branch 1 is unreachable". Why is it unreachable? I have to admit, I couldn't figure it out at first. If you feel like you're stumped by LLLPG warnings sometimes, you're not alone, they sometimes confuse me too. In this case I was confused because I thought the predicate &{nested>0} would choose whether to stay in the loop or exit. But in fact nongreedy gives the exit branch a higher priority than the first branch, so regardless of whether &{nested>0}, LLLPG will always choose the exit branch when the input is "*/".

At that point I realized that what I wanted was a loop that was neither greedy nor nongreedy, in which the priority of the exit branch is somewhere in the middle. I wanted to be able to write something like this, where "exit" is higher priority than ~('\r'|'\n') but lower priority than &{nested>0} "*/":
  public token MLCommentLine(ref nested::int)::bool @[ 
    ( &{nested>0} "*/" {nested--;}
    / "/*" {nested++;}
    / exit
    / ~('\r'|'\n')
    )*
    (Newline {return false;} | "*/" {return true;})
  ];
Unfortunately, LLLPG does not support this notation. Maybe in a future version. Here's what I did instead:
  public token MLCommentLine(ref nested::int)::bool @[ 
    (greedy
      ( &{nested>0} "*/" {nested--;}
      / "/*" {nested++;}
      / ~('\r'|'\n'|'*')
      / '*' (&!'/')
      ))*
    (Newline {return false;} | "*/" {return true;})
   ];
Here, I switched back to a greedy loop and added '*' as its own branch with an extra check to make sure '*' is not followed by '/'. If the test &!'/' succeeds, the new fourth branch matches the '*' character (but not the character afterward); otherwise the loop exits. I could have also written it like this, with only three branches:
  public token MLCommentLine(ref nested::int)::bool @[ 
    (greedy
      ( &{nested>0} "*/" {nested--;}
      / "/*" {nested++;}
      / (&!"*/") ~('\r'|'\n')
      ))*
    (Newline {return false;} | "*/" {return true;})
  ];
However, this version is slower, because LLLPG will actually run the &!"*/" test on every character within the comment.

Here's one more example using nongreedy:
// Parsing a comma-separated value file (.csv)
public rule CSVFile @[ Line* ];
rule Line           @[ Field greedy(',' Field)* (Newline | EOF) ];
rule Newline        @[ ('\r' '\n'?) | '\n' ];
rule Field          @[ nongreedy(_)*
                     | '"' ('"' '"' | nongreedy(~('\n'|'\r'))* '"' ];
This grammar describes a file filled with fields separated by commas (plus I introduced the EOF symbol, so that no Newline is required at the end of the last line). Notice that 'Field' has the loop nongreedy(_)*. How does LLLPG know to when to break out of the loop? Because it computes the "follow set" or "return address" of each rule. In this case, 'Field' can be followed by ','|'\n'|'\r'|EOF, so the loop will break as soon as one of these characters is encountered. This is different than the IfStmt example above in an important respect: Field always has the same follow set. Even though Field is called from two different places, the follow set is the same in both locations: ','|'\n'|'\r'|EOF. So nongreedy works reliably in this example because it makes no difference what context Field was called from.