LESv3 design nearing completion

LESv3 was set back somewhat by the news that WebAssembly’s developers weren’t interested in going beyond s-expressions for the MVP. I wanted to suggest using a subset of LESv3 as the text format, something that could be implemented quickly; for instance instead of using syntax like this:

(func $doubleExp (param $0 f64) (result f64)
    (get_local $0)
    (call $exp)
    (f64.const 2)
    f64.mul
)

They could have used a more compact representation like this - something that would be straightforward to parse and compatible with LES, but lacking syntactic sugar that could be added after MVP (minimum viable product):

.memory 1
.func doubleExp(`0`: f64): f64 {
    `0`
    .call $exp
    2.0f64
    f64.mul
}

LESv3 would have some nice properties even for the MVP, like that fact that it defines clearly how to write identifiers that contain arbitrary UTF-8 characters, arbitrary bytes or both. Understandably, though, they had a lot of other things to worry about and didn’t want to get into the fine bikeshedding details. Hopefully they will pick up the text format issue again after MVP, but it wouldn’t surprise me if they end up using s-expressions forever.

In the meantime I have been reviewing LESv3 - on and off - and made several decisions. The core of LES remains unchanged: it consists of C-like expressions with function calls, infix operators, prefix operators, suffix (++ --) operators, and indexing (e.g. i32[x] = y). It will also have JavaScript-like [lists], {braced blocks}, and tuples. There will be an infinite number of operators, and an operator’s precedence is chosen automatically by a set of rules.

On top of that core, LESv3 will have:

After writing the last post, I made the following decisions:

Unlike LESv2, LESv3 will not be able to parse JSON directly because newlines are significant in LESv3.

Colon operator

A colon can appear at the end of any expression to, in effect, create a label. Note that the colon character is first and foremost a normal operator character so, for example, in x++:, ++: is treated as a unit, which in this case will act as a binary operator. If there is nothing after it, x++: is a syntax error.

Colon is like a suffix operator with a precedence lower than everything else, but it behaves more like a terminator, as only a single colon is allowed (e.g. x: : is not a valid expression).

Normally, a newline is allowed after a binary operator as in

x = Foo(y) +
    Foo(z)

An exception is made for the binary colon operator so that

x = Foo(y) :
    Foo(z)

counts as two separate statements.

Keyword expressions

LES has no keywords for creating functions, classes or other things. Instead of keywords, LESv2 used two different kinds of left parenthesis (one with a space in front, the other without), which help distinguish “superexpressions” (such as a function declaration) from normal expressions. This was a pretty cool idea since it allowed code that resembled Javascript or Rust:

// LESv2 code
if x > 0 { Foo(); } else { Bar(); };

Unfortunately, while using LESv2 I noticed a couple of usability problems. The first and biggest problem was the need to write a semicolon after the closing brace. As a longtime C/C++/C# programmer, I forgot that semicolon on a regular basis (the effect of a forgotton semicolon was usually a syntax error, though occasionally it caused the statement to “merge” with the following statement, usually causing a semantic-level error like “if doesn’t expect so many arguments!”.) This problem disappears automatically when we treat newline as a terminator.

A second problem was that you couldn’t write if !flag or while !done:

if !flag { ... } // syntax error at '{': expected ';'

Since ! is also the generics operator, if !flag means @if<flag> in C# notation and the braces are unexpected. So you’d have to write if (!flag) instead.

A minor third problem was that if-else chains required parentheses after the first else if, even though parentheses are not required after the first if:

if condition {
  Foo();
} else if (x > 0) { // parentheses required
  Bar();
};

This requirement for parentheses helped increase the chance of a syntax error when the user forgot a semicolon after a closing brace.

In LESv3, “keywords” are explicitly marked with a dot. For example, .when is a keyword in this example:

.when condition {
  Foo()
} else .when x > 0 {
  Bar()
}

Of course, as with all LESv3 statements, semicolons are no longer required to mark the end. Parentheses after .when are never needed, either.

Keyword expressions have up to four parts, in this order:

  1. The dot-keyword (.when)
  2. An expression (condition)
  3. A braced block ({ Foo() })
  4. A sequence of “continuators” (else ...), which is a special word like else followed by an expression, a braced block or both.

Parts 2, 3, and 4 are optional, so the core grammar is

KeywordExpression :
  keyword
  // 'greedy' reflects the fact that the grammar is otherwise ambiguous
  greedy( Expr[StartStmt] )?
  greedy( "\n"? BracedBlock )?
  greedy( Continuator )*;

Examples:

Newlines are allowed before braces and continuators:

.when condition
{
  Foo()
}
else .when x > 0
{
  Bar()
}

This is a popular style of code so I thought it was important to allow it, but since braced blocks are valid expressions by themselves, and newlines are normally a statement terminator in LESv3, there is an ambiguity here: is the braced block { Foo() } meant to be associated with .when foo or is it an independent block of code? LESv3 assumes that any braced block after a keyword-expression is meant to be part of the keyword-expression, provided there is no more than one newline before the opening brace. For example, if you write

.if foo
// this line acts like a terminator
{
  Foo()
}

It will be parsed as two statements; instead you should write

.if foo
{ // this line does not act like a terminator
  Foo()
}

Now, if you write

.if hungry
{
  eat
}
watch_tv
{
  etc
}

The line watch_tv is considered independent from the .when expression, so this is in fact three separate “statements”: first .if hungry { eat }, second watch_tv, third { etc }. In order to add additional clauses like an else clause, LESv3 introduces a concept of “continuator keywords”: special words like else that allow a keyword expresson to continue.

Should continuator keywords be bona fide keywords, or (in accordance with LES tradition) contextual keywords: identifiers that become keyword if they appear in certain locations? I am undecided.

The set of continuators will be fixed when the definition of LESv3 is finalized; the current set is the following words:

else elsif catch except finally
while until plus minus so
and or but 

A continuator consists of a continuator keyword, followed by an expression and/or braced block:

Continuator :
  "\n"? ContinuatorKeyword
  "\n"?
  ( BracedBlock
  / TopExpr "\n"? // top-level expression
    BracedBlock?
  )

Dealing with the ambiguity

As mentioned, if a keyword expression like .keyword expression is followed by a newline and a braced block, it’s unclear whether the braced block was meant to be associated with the keyword.

Of course, an LES printer is forced to deal with this issue. It cannot print a keyword expression on one line (in the usual way) and a braced block on the next line, lest the two expressions be merged upon reparsing. The conservative solution to this problem is to add a semicolon to every expression within a braced block.

A more careful solution is to add a semicolon on expressions that contain (or end with) a keyword expression (e.g. x = .foo x { };) if they are followed by an expression that begins with a braced block (e.g. {x} = {5} or {}()). It’s best for a printer not to try to be more clever than this, because you can’t necessarily avoid a semicolon when the keyword expression ends with a closing brace. That’s because a braced block is an expression in its own right and can count as one of the “expression” parts of a keyword expression instead of as the “braced block” part. For example, .if {x} {y} else {z} ends with a closing brace, but could be followed by another braced block because {z} counts as the “expression” part of the else clause, not as the “braced block part”.

We don’t want end-users to think about this, of course. You can design an LES-based language where this is not an issue; for example, if a language is designed so that a braced block cannot appear at the beginning of an expression, the issue does not arise. And let’s consider what happens if we built a C-like language on top of LES. Only six statements in C begin with a specific keyword and do not have an associated braced block:

break
continue
goto
return
case
typedef

All of these except case and typedef alter control flow, so if there is a braced block after them, the braced block is normally unreachable and didn’t really need to exist. Meanwhile, case could be expressed in LES without using .case, and .typedef would appear in a non-executable context where a braced block is meaningless. So in practice, the ambiguity is not a problem here.

Typically, languages will often have some reason why a braced block at the beginning of a statement (and after a keyword-expression) could be valid code, but no good reason for users to actually write that code. As another example, if you use braced blocks to define dictionaries then

{"print": x => stdout.print(x)}["print"](123)

might be a valid statement that prints 123, yet there is no real-world reason to write a statement of that form.

Adding keywords to LES

LES itself won’t have a mechanism to define traditional keywords (with no “.” in front of them), but it should be straightforward to add them programmatically. For instance, the .NET Les3LanguageService lets you perform lexing and parsing as two separate steps, which lets you insert a lexer filter in between the lexer and the parser. Within this filter you could find identifiers that you want to treat as keywords, and replace those identifiers with keyword tokens (changing class to .class, while leaving `class` alone.) Perhaps in the future I’ll add a filter class to the Loyc.Syntax library so you can do this more easily.

However, I’m considering whether to pick a set of predefined keywords so that typical languages built on LES won’t want to resort to such tricks.

Word Operators

An identifier that starts with a lowercase letter can be used as a right-associative binary operator with very low precedence. This allows binary operators that look as if they were “built into” LES:

Scream("Negative inputs are not allowed!") if input < 0
.class Foo inherits FooBase implements IBar { /*...*/ }

In the syntax tree produced by LESv3, a single quote (') will appear before the operator name. For example, paper beats rock has the syntax tree `'beats`(paper, rock). Thus, operators are easily distinguished from normal function calls.

There is nothing ambiguous about allowing ordinary words as operators, although it is potentially confusing to users if it is not obvious which words are operators and which are normal identifiers, so an LES-based language should use such words judiciously (e.g. by only allowing words from a predefined list to be used as operators).

If we’re not careful, there is a potential for people to write code that is “obviously” wrong but still parses. For this reason there are a couple of special rules about how word-operators are used that do not apply to normal binary operators:

  1. Unlike normal operators, a word operator cannot be followed by a newline, except inside parentheses or square brackets.
  2. Word operators must be followed by a space or tab character.

The first rule ensures that an error in one expression doesn’t cause it to inadvertently “merge” with the line afterward. The second rule catches some mistakes like these:

Leave() if if MovieIsBoring() // repeated operator
food = fridge.Foo Food(x, y)  // two identifiers where one was expected
x = x y*z                     // uhh... what?

I am also considering a third rule:

  1. Word operators cannot use uppercase letters, or mixed case.

Mixed-case identifiers are very common in some languages, notably C# and VB where virtually all public APIs have both uppercase and lowercase letters. Disallowing them as operators would form a third line of defense against the parser accepting code that contains a mistake.

I was thinking of defining two kinds of word operator: UPPERCASE and lowercase. Lowercase operators would have very low precedence, while uppercase operators would have somewhat high precedence. However, for the sake of simplicity perhaps there should be only one precedence in LESv3 and disallow uppercase letters. If LES becomes successful, uppercase operators could be added in LESv4.

The precedence of lowercase word operators produces the following equivalences:

// Right associative
a + b op_one c + d op_two e + f === (a + b) op_one ((c + d) op_two (e + f)) 
// Ultra-low precedence
a = b && c operator x = y || z  === (a = (b && c)) operator (x = (y || z))
// The only operator with lower precedence is `=>`, and only on its right
fn = x => x operator 123        === fn = (x => (x operator 123))
// A word operator can be embedded within a keyword expression
.keyword x operator 123 { }     === .keyword (x operator 123) { }
// When a keyword expression ends, it can be followed by a word operator
.keyword { } operator 123       === (.keyword { }) operator 123

Their extremely low precedence makes word operators a natural companion of keyword expressions. Keyword expressions like .eat x { } cannot accept multiple arguments before the braces, e.g. .foo x, y { } is invalid. Word operators solve this problem by taking the place of the comma, as long as you can think of a name that fits the situation, e.g. .watch "Iron man" at 8pm. Whatever code handles the .watch expression would then check that 'at was called with two arguments.

An identifier cannot be used as an operator, however, if it is one of the “continuators” (else, catch, finally, etc.). This rule eliminates the ambiguity in .foo x {} else y: clearly else is a continuator and not a word operator.

Combo Operators

Dan Gohman proposed sign-aware infix operators for WebAssembly, as in

$x = $y >s $z

Background: WebAssembly data types are unaware of signedness. For example, i32 (32-bit integer) is neither signed nor unsigned, and so the operator itself must indicate whether a signed or unsigned operation should be performed: hence >s for “signed greater than”. This version of signed operators can’t work in LES because it is ambiguous: in general, there’s no way to tell that s is part of the operator and not a separate identifier.

After realizing that word operators were a workable concept, I was delighted to discover we could combine a word with a normal operator like this:

$x = $y s> $z

This is not ambiguous. s appears right after an expression so it cannot be interpreted as a normal identifier. It cannot be a word operator either, because word operators must be followed by a space or tab and s is not. Instead, the identifier s and the punctuation > are merged by the parser into a single unit called a combo operator. The punctuation portion of the operator determines its precedence; thus s> has the same precedence as >.

The word and punctuation have unlimited length, and the operator must be followed by a space, tab, or newline character to clearly mark where it ends. In the final syntax tree, the operator gets a single quote in front of it, so $y s> $z means `'s>`($y, $y).

The disambiguation marker

An LESv3 parser should create a “trivia attribute” to record the fact that an expression is in parentheses. For example, (x*y)+z has parentheses trivia while x*y+z does not. However, the trivia should not be recorded if the parentheses are just used to associate attributes with a subexpression, as in @addition (@multiplication x * y) + z.

Occasionally a printer is asked to print a problematic expression like `'*`(x + y, z) and would like to use operator notation to do so, even though the tree doesn’t include parentheses trivia. For this purpose, LESv3 defines the disambiguation marker @@. @@ acts like an “empty attribute”: it is allowed where attributes are allowed, but it does not create an attribute. This allows a printer to more easily represent the presence or absence of parentheses in the Loyc tree. So (@@ x*y)+z and x*y+z will have identical Loyc trees, and more importantly (@@ x + y) * z has the same tree as `'*`(x + y, z).

Printers should add a space after @@, since @@x counts as a custom literal, not as the identifier x.

New idea: Token Lists (pseudo-s-expressions)

My implementation of LESv2 had “token tree” literals, which allow custom syntax to be embedded in LES code. I used them to support my parser generator. For instance you can write a grammar like this in LESv2:

// Invoke parser with token literal @{...}
LLLPG lexer @{
  EmailAddress :
    UsernameChars ('.' UsernameChars)*
      '@' DomainCharSeq ('.' DomainCharSeq)*;
  UsernameChars :
    ('a'..'z'|'A'..'Z'|'0'..'9'|'!'|'#'|'$'|'%'|'&'|'\''|
     '*'|'+'|'/'|'='|'?'|'^'|'_'|'`'|'{'|'|'|'}'|'~'|'-')+;
  DomainCharSeq :
    ('a'..'z'|'A'..'Z'|'0'..'9')
    ( '-'? ('a'..'z'|'A'..'Z'|'0'..'9') )*;
};

What mechanism should LESv3 use for custom syntax? We already have triple-quoted custom literals, so we already have enough for an MVP (minimum viable product) since this grammar could be written as

.LLLPG lexer'''
  EmailAddress :
    UsernameChars ('.' UsernameChars)*
      '@' DomainCharSeq ('.' DomainCharSeq)*;
  UsernameChars :
    ('a'..'z'|'A'..'Z'|'0'..'9'|'!'|'#'|'$'|'%'|'&'|'\''|
     '*'|'+'|'/'|'='|'?'|'^'|'_'|'`'|'{'|'|'|'}'|'~'|'-')+;
  DomainCharSeq :
    ('a'..'z'|'A'..'Z'|'0'..'9')
    ( '-'? ('a'..'z'|'A'..'Z'|'0'..'9') )*;
''';

But this is less convenient; there will be no syntax highlighting, for example.

The main disadvantage of token literals is that they are not Loyc trees; they are a separate data structure that would have to be defined in detail if LES were a widely-used standard. It would be simpler and easier, I think, to store custom syntax as a Loyc tree. Therefore, I propose a mapping from token lists to Loyc trees.

I propose the unary single quote operator (') to introduce a token list. The single quote by itself would creates a call to `'`, with each token afterward treated as an arguments. For example, this code:

x = ' <div style="...">123</div>;

would be equivalent to

x = `'`( `'<`, div, style, `'=`, "...", `'>`, 123, `'</`, div, `'>` )

To understand this, remember that LES lets you write special identifiers in backquotes and operators in LES are already stored with an initial single quote, e.g. 2 + 2 really means `'+`(2, 2).

Note: I used HTML code as an example because it is easily recognized, but you should not try to represent HTML this way since the syntax is too different. LES, for example, uses C-style escapes like "\"" while HTML does not. Instead, use a custom string like html'''<p></p>''' and parse it with a dedicated HTML parser. While token lists could be used to approximate another language like C or HTML, they are mainly intended for DSLs (domain-specific languages) designed to be LES-compatible.

The single quote can be followed immediately by a sequence of characters in the same character set as @@-literals. This becomes part of the name of the operator, e.g.

x = 'make my breakfast
// equivalent to
x = `'make(my, breakfast)
// equivalent to
x = my make breakfast

If the token list contains parentheses, it creates a sublist with, perhaps, a target of '(). Example:

x = '(* (a b c) (Foo))
// equivalent to
x = `'`(`'*`, `'()`(a, b, c), `'()`(Foo))

These token lists work significantly differently from s-expressions, mainly because spaces are not generally required between arguments: Foo* is two arguments, not one. Still, you could write very Lisp-like code this way.

One cannot directly write a Lisp-style prefix expression like + x (* y z) that would mean x + (y * z) in LES, but this can be made possible with a simple post-processing step such as a LeMP macro.

By the way, it’s possible to examine the output from LES to find out if two tokens were adjacent - a fact which postprocessing steps might exploit for their own reasons. For example, in ' Foo*, the * Range.EndIndex of the identifier Foo equals the Range.StartIndex of the identifier '*.

If the token list contains square brackets or braces, the contents of those brackets or braces are parsed as normal expressions. This provides a predictable way for “normal” code to be embedded within custom syntax. For example, my parser generator could use square brackets to represent arguments to rules, and braces to represent grammar actions and semantic predicates:

.LLLPG parser('
  // Scans up to 10 digits of an integer
  UpTo10Digits ::= Digits[10];
  Digits[maxLen: int] ::=
      {len := 0}
      (&{len < maxLen} '0'..'9' {len++})+;
)

The stuff like maxLen: int and len++ would be parsed as normal expressions, while everything else would be parsed as token lists. In turn, those token lists would be interpreted by a .LLLPG macro that decides what they mean.

Additional details:

Closing thoughts

Undecided issues

What keywords?

Here’s what a predefined keyword set might look like. We’d start with Javascript keywords that are not identifier-like (int, long, this), attribute-like (public, volatile) or operators (typeof, instanceof):

// Simple statements (7)
goto  return  break  continue  
throw  case  var

// Block statements (8)
function  class
if  while  do  for  switch  try 

// Expressions (2)
new  delete

// "Future" reserved words (5)
interface  enum  import  export  package

Why focus on JavaScript? JavaScript is the only language similar enough to LES that partial compatibility is possible. Other languages like C, C#, C++ and Java have too many incompatible elements, e.g. typed variable declarations (Foo x), angle-bracket generics (Foo<T>), and paren-typecases ((Foo) x).

JavaScript also has some elements that LES wouldn’t handle; the biggest problem is block statements without braces, like if (c) f(). Also unsupported: typeof e and void 0 (you’d have to change these to typeof(e) and void(0)). The JavaScript expression x instanceof y could still work, although its precedence is wrong. Similarly, JavaScript’s yield* could work, but the precedence of * is high so in some cases the expression after yield* would need parentheses around it.

Note that LES would treat new significantly differently from Javascript (because LES keyword expressions act like they have very low precedence, unlike JS new), but typical JavaScript code like var x = new Foo(y) would parse successfully.

We would then add some additional keywords, such as:

// Type & namespace definitions (4)
struct  type  alias  namespace

// Other (8)
fn  prop  process  construct  
data  loop  yield  expect

These cover some common concepts from various programming languages:

Including literals and continuator keywords, LES would have around 48 keywords: more than C (32), but fewer than C++ (90), C# (77), Javascript (59) or Java (52).

How complex is LESv3?

Compared to JSON or s-expressions, LES is quite complicated. Also, LESv3 is a bit more complex than LESv2.

However, even if we add lots of new keywords, parsing LES is still be easier than parsing all popular languages in the C family. For example, compared with C, LES avoids the complexity of feedback from the symbol table to the parser/lexer, the complexity of the preprocessor, and the subtleties of C type declarations like char *(*(* const *foo[][8])())[] and “abstract declarators” like int (*(*)())(). In terms of code size, a C parser will be moderately larger before adding the mandatory preprocessor. For reference, the main LESv3 parser grammar (excluding the lexer and some helper code) is currently 333 lines, which the parser generator expands to 810 lines of C#. The C parser pycparser, written in Python, is 1740 lines (excluding the lexer, preprocessor, and other helper code, though this does include an unusually large amount of documentation).

LESv3 is at least 4 times simpler than C# or Enhanced C#:

Fun fact: my LES implementation does something you won’t get from most other parsers: it preserves comments in the Loyc tree as so-called “trivia attributes”. This feature is not implemented by the parser, though. It is a language-independent algorithm used by LESv2, LESv3 and Enhanced C#.