Parsing a Loyc language with LLLPG, part 1
Introducing LeMP 1.3
Today I’ll start by giving you a recipe for parsing programming languages with LLLPG. But first I’d like to introduce a couple of other macros that come with LeMP (the macro processing engine that LLLPG runs on top of).
-
replace
is a macro that replaces all instances of some pattern with some other pattern. For example,/// Input replace (MB => MessageBox.Show, FMT($fmt, $arg) => string.Format($fmt, $arg)) { MB(FMT("Hi, I'm {0}...", name)); MB(FMT("I am {0} years old!", name.Length)); } /// Output MessageBox.Show(string.Format("Hi, I'm {0}...", name)); MessageBox.Show(string.Format("I am {0} years old!", name.Length));
-
unroll..in
is a kind of compile-timeforeach
loop. It generates several copies of a piece of code, replacing one or more identifiers each time. Unlikereplace
,unroll
can only match simple identifiers on the left side ofin
./// Input void SetInfo(string firstName, string lastName, object data, string phoneNumber) { unroll ((VAR) in (firstName, lastName, data, phoneNumber)) { if (VAR != null) throw new ArgumentNullException(nameof(VAR)); } ... } /// Output void SetInfo(string firstName, string lastName, object data, string phoneNumber) { if (firstName != null) throw new ArgumentNullException(nameof("firstName")); if (lastName != null) throw new ArgumentNullException(nameof("lastName")); if (data != null) throw new ArgumentNullException(nameof("data")); if (phoneNumber != null) throw new ArgumentNullException(nameof("phoneNumber")); ... }
This example also uses the
nameof()
macro to convert each variable name to a string.
These macros can be used to avoid code duplication in a lexer-parser combo. To learn more, read Avoid Tedious Coding with LeMP.
Programming language parsing template
So here’s the formula. First, define a list of punctuation tokens:
// A list of simple tokens to be represented literally
replace (OPERATOR_TOKEN_LIST => (
(">>", Shr), // Note: as a general rule, in your lexer you should list
("<<", Shl), // longer operators first. We will use this token list
("=", Assign), // in the lexer, so longer operators are listed first here.
(">", GT),
("<", LT),
("^", Exp),
("*", Mul),
("/", Div),
("+", Add),
("-", Sub),
(";", Semicolon),
("(", LParen),
(")", RParen)));
Keywords need to be handled separately, as explained later.
Next, define an enum for all the token types:
using TT = TokenType; // Abbreviate TokenType as TT
// Usually you'll need an enum containing the kinds of tokens you'll recognize.
public enum TokenType
{
EOF = 0, // If you use EOF = 0, default(Token) represents EOF
Id,
Num,
unroll ((TEXT, TOKEN_NAME) in OPERATOR_TOKEN_LIST)
{
TOKEN_NAME; // inside 'unroll', must use ';' instead of ',' as separator
},
Unknown
}