This page has moved to ecsharp.net.

LLLPG Part 1: A new parser generator for C#

7 Oct 2013 (updated 6 Mar 2016). Originally published on CodeProject.

Introduction

LLLPG (Loyc LL(k) Parser Generator) is a recursive-decent parser generator for C#, with a feature set slightly better than ANTLR version 2.

In this article I assume you already know what parsers and lexers are; if not, click the links. If you haven’t written parsers before, article #2 will fill in your knowledge.

LLLPG is a system that I decided to create after trying to use ANTLR3’s C# module, and running into C#-specific bugs that I couldn’t overcome. Besides, I wasn’t happy with the ANTLR-generated code; I thought I could generate simpler and more efficient code. “How hard could it be to make an LL(k) parser generator?” I wondered. The answer: pretty damn hard, actually.

It took me a bit longer to make LLLPG than I intended (what, 5 years?), but… better late than never, right? While ANTLR has advanced some ways in that time period, it is still Java-centric, and I think the advantages of LLLPG still make it worth considering even if all the major C#-specific bugs in ANTLR have been fixed (I don’t know if they have or not, but the C# version still lags behind the Java version).

Typically, you will use the LLLPG Visual Studio Custom Tool (a.k.a. Single-File Generator):

LLLPG in Visual Studio

LLLPG is not a dedicated tool the way ANTLR is. Instead, LLLPG is designed to be embedded inside another programming language. While you may use LLLPG similarly to other parser generators, it’s really just a “macro” inside a programming language I’m making called Enhanced C# — one of a hundred macros that you might be using, and perhaps in the future you’ll write a macro or two yourself.

As of 2016, Enhanced C# is incomplete; only two components of it are ready (the parser, and the macro runner which is called LeMP). Hopefully though, you’ll find it fairly user-friendly and fun.

Other design elements of LLLPG include:

“Blah, blah, blah! Show me this thing already!”

Okay, let’s get started! Here’s a really simple example (EC#):

LLLPG (lexer) {
    public rule Digit @{ '0'..'9' };
    public rule Integer @{ Digit+ };
};

And the output is:

public void Digit()
{
    MatchRange('0', '9');
}
public void Integer()
{
    int la0;
    Digit();
    for (;;) {
      la0 = LA0;
      if (la0 >= '0' && la0 <= '9')
        Digit();
      else
        break;
    }
}

That’s it! So here’s some relevant facts to learn at this point (I love bulleted lists, don’t you? I wish people would use more of them!):

Another example

My next example is almost useful: a simple lexer.

using System;

enum TT { 
	Integer, Identifier, Operator
}
LLLPG (lexer) {
    private token Letter @{ ('a'..'z'|'A'..'Z') };
    private token Id  @{ (Letter|'_')
                         (Letter|'_'|'0'..'9')* };
    private token Int @{ '0'..'9'+ };
    private token Op  @{ '+'|'-'|'*'|'/' };
    public token TT Token @
      {  Op  {return TT.Operator;}
      |  Id  {return TT.Identifier;}
      |  Int {return TT.Integer;}
      };
};

In this example, using System and the enum are completely standard C# code and it will be printed out unchanged (well, it’ll be reformatted. Whitespaces and comments are not preserved.) The rest is a mixture of Enhanced C# and LLLPG code.

Unlike ANTLR, which generates text output and treats actions in braces {...} like plain text, LLLPG fully parses its input, and generates output in the form of a Loyc tree, not text. An independent library is in charge of formatting that Loyc tree as C# code (I welcome volunteers to write output libraries for other languages such as C++ and Java. You won’t just be helping LLLPG itself, but the entire Loyc project! Let me know if you’re interested.)

Here’s the generated code:

using System;
enum TT
{
    Integer, Identifier, Operator
}
void Letter()
{
    Skip();
}
void Id()
{
    int la0;
    la0 = LA0;
    if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
      Letter();
    else
      Match('_');
    for (;;) {
      la0 = LA0;
      if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
        Letter();
      else if (la0 == '_')
        Skip();
      else if (la0 >= '0' && la0 <= '9')
        Skip();
      else
        break;
    }
}
void Int()
{
    int la0;
    MatchRange('0', '9');
    for (;;) {
      la0 = LA0;
      if (la0 >= '0' && la0 <= '9')
        Skip();
      else
        break;
    }
}
void Op()
{
    Skip();
}
public TT Token()
{
    int la0;
    la0 = LA0;
    switch (la0) {
    case '*':
    case '+':
    case '-':
    case '/':
      {
        Op();
        return TT.Operator;
      }
      break;
    default:
      if (la0 >= 'A' && la0 <= 'Z' || la0 == '_' ||
                la0 >= 'a' && la0 <= 'z') {
        Id();
        return TT.Identifier;
      } else {
        Int();
        return TT.Integer;
      }
      break;
    }
}

This example demonstrates some new things:

The LeMP processing model

If you’re only interested in the parser generator, please skip this section, because right now I’d like to discuss the fancy technology that LLLPG is built on. In fact, you can skip most of the rest of the article and go straight to part 2.

So here’s the deal. I designed a language called Enhanced C#. It’s supposed to be about 99.7% backward compatible with C#, and the parser is about 95% complete (LINQ support is missing, but C# 6 syntax is available.) There is no EC# compiler yet, but there is a printer; instead you use the parser + LeMP + printer and feed the output to the plain C# compiler. With a few lines of code, you can parse a block of EC# code and print it out again:

using (LNode.PushPrinter(Ecs.EcsNodePrinter.Printer))
using (ParsingService.PushCurrent(Ecs.Parser.EcsLanguageService.Value))
{
    string input = @"{ your_code_here(Any_EC#_code_will_work); }";
    LNode code = ParsingService.Current.ParseSingle(input, MessageSink.Console);
    string output = code.ToString();
}

Since EC# is a superset of C#, LLLPG is able to produce C# code by using the EC# printer, as long as it only uses the C# subset of the language.

Earlier you saw a screen shot of LINQPad in which the input to LLLPG was LES code. LES is not a programming language, it is just a syntax and nothing else. One of the core ideas of the Loyc project is to “modularize” programming languages into a series of re-usable components. So instead of writing one big compiler for a language, a compiler is built by mixing and matching components. One of those components is the Loyc tree (the LNode class in Loyc.Syntax.dll). Another component is the LES parser (which is a text representation for Loyc trees). A third component is the EC# parser, a fourth component is the EC# printer, and a fifth component is LeMP, the macro processor.

Macros

Macros are a fundamental feature of LISP that I am porting over to the wider world of non-LISP languages.

A macro (in the LISP sense, not in the C/C++ sense) is simply a method that takes a syntax tree as input, and produces another syntax tree as output. Here’s an example of a macro:

[LexicalMacro("Name::Type",
  "Defines a variable or field in the current scope.", "#::")]
public static LNode ColonColon(LNode node, IMacroContext sink)
{
    var a = node.Args;
    if (a.Count == 2) {
        LNode r = node.With(CodeSymbols.Var, a[1], a[0]);
        r.BaseStyle = NodeStyle.Statement;
        return r;
    }
    return null;
}

This macro is part of the “LES Prelude class” for LeMP. Its job is to take a LES “variable declaration” such as x::Foo and change it into a different tree that the C# language service understands: #var(Foo, x), which represents Foo x;.

The input, x::Foo, is represented as a call to #:: with two arguments, x and Foo. ColonColon() is designed to transform the call to “#::”. It checks that there are two arguments, and swaps them while changing the call target from #:: to S.Var, which is an alias for #var. The C# node printer considers #var(type, name) to represent a variable declaration, and prints it with the more familiar syntax “type name”.

The point is, LLLPG is defined as a “macro” that takes your LLLPG (lexer) { ... }; or LLLPG (parser) { ... }; statement as input, and returns another syntax tree that represents C# source code. As a macro, it can live in harmony with other macros like the ColonColon macro.

Bootstrapping LLLPG

In order to allow LLLPG to support EC#, I needed a EC# parser. But how would I create a parser for EC#? Obviously, I wanted to use LLLPG to write the parser, but without any parser there was no easy way to submit a grammar to LLLPG! After writing the LLLPG core engine and the EC# printer, here’s what I did to create the EC# parser:

  1. I used C# operator overloading and helper methods as a rudimentary way to write LLLPG parsers in plain C# (example test suite).
  2. Writing parsers this way is very clumsy, so I decided that I couldn’t write the entire EC# parser this way. Instead, I designed a new language that is syntactically much simpler than EC#, called LES. This language would serve not only as the original input language for LLLPG, but as a general syntax tree interchange format—a way to represent syntax trees of any programming language: “xml for code”.
  3. I wrote a lexer and parser for LES by calling LLLPG programmatically in plain C# (operator overloading etc.)
  4. I wrote the [MacroProcessor][16] (which I later named “LeMP”, short for “Lexical Macro Processor”) and a wrapper class called [Compiler][17] that provides the command-line interface. MacroProcessor’s job is to scan through a syntax tree looking for calls to “macros”, which are source code transformers (more on that below). It calls those transformers recursively until there are no macro calls left in the code. Finally, Compiler prints the result as text.
  5. I built a small “macro language” on top of LES which combines LeMP (the macro processor) with a set of small macros that makes LES look a lot like C#. The macros are designed to convert LES to C# (you can write C# syntax trees directly in LES, but they are a bit ugly.)
  6. I wrote some additional macros that allow you to invoke LLLPG from within LES.
  7. I hacked the LES parser to also be able to parse LLLPG code like @{ a* | b } in a derived class (a shameful abuse of “reusable code”).
  8. I wrote a lexer and parser for LES in LES itself.
  9. I published this article in (Oct 2013).
  10. I wrote the lexer and parser of EC# in LES, in the process uncovering a bunch of new bugs in LLLPG (which I fixed)
  11. I added one extra macro to support LLLPG in EC#.
  12. Finally, to test the EC# parser, I rewrote the grammar of LLLPG in EC#.
  13. I updated this article (Feb 2014).

At long last the bootstrapping is complete, so you can write LLLPG parsers in EC#!

LLLPG’s input languages: EC# & LES

Enhanced C# (EC#)

As I mentioned, Enhanced C# is a language based on C# whose compiler doesn’t exist yet (I’m looking for volunteers to help build the compiler, stay tuned.) The parser does exist, though, so I can talk about some of the new syntax that EC# supports. Actually there is quite a bit of new syntax in EC#; let me just tell you about the syntax that is relevant to LLLPG.

Token literals

Loyc.Syntax.Lexing.TokenTree eightTokens = @{
    This token tree has eight children
    (specifically, six identifiers and two parentheses.
     The tokens inside the parentheses are children of the opening '('.)
};

LLLPG is a “domain-specific language” or DSL, which means it’s a special-purpose language (for creating parsers).

Token trees are a technique for allowing DSLs (Domain-Specific Languages) without any need for “dynamic syntax” in the host language. Unlike some “extensible” languages, EC# and LES do not have “extensible” parsers: you cannot add new syntax to them. However, EC# and LES do have token literals, which are collections of tokens. After a source file has been parsed, a macro can run its own parser on a token tree that is embedded in the larger syntax tree. That’s what LLLPG does.

EC# allows token trees in any location where an expression is allowed. It also allows you to use a token tree instead of a method body, or instead of a property body. So when the EC# parser encounters statements like these:

rule Bar @{ "Bar" };
rule int Foo(int x) @{ 'F' 'o' 'o' {return x;} };

The parser actually sees these as property or method declarations. LLLPG’s ECSharpRule macro then transforms them into a different form, shown here:

#rule(void, Foo, (), "Bar");
#rule(int, Foo, (#var(int, x),), ('F', 'o', 'o', { return x; }));

The main LLLPG macro is in charge of turning this into the final output:

void Bar()
{
  Match('B');
  Match('a');
  Match('r');
}
int Foo(int x)
{
  Match('F');
  Match('o');
  Match('o');
  return x;
}

Note: you may also see token literals with square brackets @[...]. This means the same thing as @{...}; there are two syntaxes for a historical reason, as explained in the next article.

Block-call statements

get { return _foo; }
set { _foo = value; }

In C#, you may have noticed that “get” and “set” are not keywords. Not unless they are inside a property declaration, that is. In C#, they “become” keywords based on their location.

EC# generalizes this concept. In EC#, get and set are never keywords no matter where they appear. Instead, get {...} and set {...} are merely examples of a new kind of statement, the block-call statement, which has two forms:

identifier { statements; }                  identifier (argument, list) { statements; }

These statements are considered exactly equivalent to method calls of the following form:

identifier({ statements; });                 identifier(argument, list, { statements; });

So the LLLPG block:

LLLPG (parser(laType(TokenType), matchType(int))) {
   rule Foo @{ ... };
}

Is a block-call statement, equivalent to

LLLPG (parser(laType(TokenType), matchType(int)), {
   rule Foo @{...};
});

Blocks as expressions

string fraction = "0.155";
float percent = 100 * { if (!float.TryParse(fraction, out float tmp)) tmp = -1; tmp };

In EC#, {braced blocks} can be used as expressions, which explains what a method call like foo(x, { y(z) }) means. When a block is used inside an expression, like this, the final statement in the block becomes the return value of the block as a whole, when there is no semicolon on the end of the final statement. In this case, tmp is the result of the block, so percent will be set to 15.5. Or rather, it will be set to 15.5 after the EC# compiler is written. Until then, this is merely an interesting but useless syntax tree.

In the case of a statement like

LLLPG (parser(laType(TokenType), matchType(int)), { rule Foo @{...}; });

Everything in parenthesis is passed to a macro belonging to LLLPG, which (to make a long story short) transforms it into C# code.

That’s enough information to understand how LLLPG works. Hopefully now you understand the concept of LLLPG as a DSL embedded in EC#.

Loyc Expression Syntax (LES)

LES is very comparable to EC#, especially its lexer (e.g. "strings", 'c'haracters, and @identifiers are practically the same between the two languages). It has

Important: All LES-based parsers now need to have the following first line:

import macros LeMP.Prelude.Les;

Originally this was not required because the LES “prelude” macros were imported automatically. However, the LES prelude could potentially interfere with normal C# code, so it is no longer imported automatically (the macro compiler doesn’t know anything about the input language, so it is unaware of whether it should import the macros or not).

When LLLPG was first released, you had to use LES, so I wrote the following section which describes the relationship between LES code and C# code. If you want, you can still write parsers in LES, but of course most readers will prefer EC#. If you are using “LeMP” instead of “LLLPG” as your single-file generator, add the line

import macros Loyc.LLPG;

at the top of your file to gain access to LLLPG.

Differences & similarities between LeMP/LES and C#

I don’t want to bore you with all the details, but the most striking difference between LES and C# is that LES has no keywords whatsoever. Words like if and while are not parsed in a special way because of the actual word used, but because of the how the statement is formatted.

Anyway, here’s a list:

Similarities:

The following statements mean the same thing in LeMP/LES and C#:

return;
return x;
continue;
break;
var x = value;                  x = y++ * 10;                                   Console.Write("Hi");

Differences:

In many cases the difference is simply that you need an extra semicolon or braces.

using System.Collections.Generic;  import System.Collections.Generic;
class Foo {   public Foo() : this(0) { }
  public Foo(int num) { }
}
class Foo {   public cons Foo() { this(0); };
  public cons Foo(num::int) { };
};

class Foo : BaseClass, IEnumerable { }  class Foo(BaseClass, IEnumerable!object) { };
int Square(int x)       { return x*x; }  def Square(x::int)::int { return x*x; };
void Quit() { throw new QuitException(); }    def Quit()  { throw (new QuitException()); };
int x, y; string s;                 x::int; y::int; s::string;
int[] list1;      List list2;  list1::array!int; list2::List!int;
int? maybe = null;       bool flag = true;
maybe::opt!int = @null;  flag::bool = @true;

int more = (int)(num * 1.5);
more::int = (num * 1.5) -> int;

while (x) Foo();      while (x) { Foo(); }; while x   { Foo(); };
do x/=2; while (x>y);     do { x/=2; } while (x>y);
for (x = 0; x < 10; x++) Console.WriteLine(x);      for (x = 0, x < 10, x++) { Console.WriteLine(x); };
foreach (var item in list) { Process(item); }  foreach (item in list)    { Process(item); };
if (c) return 0;      if (c) { return 0; }; if c   { return 0; };
if (list.Count > 0) str.Append(list[i]);         unless list.Count == 0 { str.Append(list[i]); };
if (inc) x++; else x--;      if (inc) {x++;} else {x--;}; if inc   {x++;} else {x--;};
switch (x) { case 123: break; default: break; }          switch (x) { case 123; break; default; break };          switch (x) { case 123 { break; }; default { break; }; };
try { } catch (Exception ex) { } finally { }; try { } catch ex::Exception  { } finally { };

In fact, in many of these cases the braces are not actually required, but you should include them as long as you don’t fully understand how LES works. Many error messages that don’t mention a semicolon are actually referring to a missing semicolon; LeMP gets confused because code like

if (a) { ... } if (b) { ... }

is parsed successfully as a single statement if (a, {...}, if, b, {...}), which the “if” macro does not understand because there are too many arguments.

Wrapping up

Obviously, I’ve just scratched the surface here, but this is almost enough for an introduction. As a next step, you can try out the demo lexer and parser included with this article.

The bug in the SFG custom tool (LllpgForVisualStudio.exe 1.0) that involved this weird exception courtesy Microsoft:

InvalidCastException: Unable to cast COM object of type ‘System.__ComObject’ to interface type ‘Microsoft.VisualStudio.Shell.Interop.IVsGeneratorProgress’. This operation failed because the QueryInterface call on the COM component for the interface with IID ‘{BED89B98-6EC9-43CB-B0A8-41D6E2D6669D}’ failed due to the following error: No such interface supported (Exception from HRESULT: 0x80004002 (E_NOINTERFACE)).

has been fixed in 1.1.0, so error messages will appear in the normal error list instead of on message boxes.

For more articles about LLLPG, return to the home page. Here’s a list of topics that are covered in future articles:

So that’s it! Hope you like my parser generator, folks, and I’ll be happy to answer your questions. But you probably don’t have questions. Because I pretty much covered everything.

History