A potential new LES

I’ve proposed that WebAssembly adopt LES as the basis for its text format (or at least, to constrain the Wasm text format such that LES is a superset of it.) As part of that proposal, I’ve agreed to modify LES to suit the tastes of CG members. So far, only a couple of people have weighed in; in the meantime, I’ve been thinking preemptively about what changes to LES might make it more liked as a text format.

This document will take a while to read, but is designed to require only a passing familiarity with LES.

But before I talk about a new version, let’s discuss what works well in the current version (LESv2) and then I’ll mention the pain points I’ve noticed when using LES, and potential issues for LES + WebAssembly.

A reminder about notation: syntax trees will be expressed as simple LESv2 without superexpressions. For example:

What works well in LESv2

The basic expression stuff works well and there’s no reason to change it:

Also as long as the empty statement ‘;’ is allowed, it seems reasonable to also keep empty expressions (e.g. Foo(4,) takes two arguments, the second being the empty identifier), unless perhaps we switch the tuple syntax to use commas (in the interest of concision, I won’t explain the issue here).

Existing pain points in LESv2

Issues for WebAssembly

Ideas to change LES

Dealing with identifiers containing invalid UTF-8

I’m inclined to think this cannot be solved in a nice way, because (at least outside the Wasm world) it is not reasonable, from the standpoint of API usability, to return identifiers as byte arrays on platforms like Java and .NET that use UTF-16 strings. I think we can get a little wiggle room by exploiting orphaned surrogate pair characters, though, to guarantee round-tripping from arbitrary bytes to UTF16 and back, without affecting the conversion of normal UTF-8.

A special notation could be offered for invalid UTF-8, e.g. @`\?AA` for the byte string "\xAA" (I think the identifier @`\xAA` should refer to the character 0xAA rather than the byte 0xAA).

“I don’t want to write ; after }

It’s easy to forget that semicolon at the end of if (c) {...};, and I plan the following rule to eliminate the need for it:

If the outer expression is a superexpression with an AfterParticle that is a braced block not followed by a semicolon, then the expression must end at the closing brace, as if a semicolon were present, unless the braced block is followed by a “continuator” identifier. A continuator is either an identifier that starts with @, or one of a small set of words from a predefined list that includes else, catch, finally, and except (and others TBD). For example, if c {...} else {...} would be parsed as a single expression, whereas if c {...} loop {...} would be parsed as two independent expressions.

Note: this rule wouldn’t apply to the superexpression’s initial expression, so for example the closing brace in do {...} while (foo); does not count as the end-of-statement, even though while is not on the list of continuators. Similarly loop {...}; would require a semicolon, but for (x : list) {...} would not.

Wait a minute… it may seem odd that I rejected the idea of a fixed set of keywords but now suggest a fixed set of continuators. The reason is that the set of continuators used by most programming languages is far smaller and more predictable than the set of keywords: in some languages, the only continuators are else, catch, and finally. Also, continuators still aren’t keywords.

“I want no separate ‘(’ and ‘ (’ tokens.”

I have a couple of ideas for satisfying this desire, by replacing the current concept of superexpressions with something more… diversified. My first idea is to introduce three bits of syntactic sugar:

  1. Block-call expression (adds an argument): primary_expr {...} and primary_expr (...) {...}
  2. After a block-call expression, a “continuator” is permitted from a predefined set that includes else catch finally where or any identifier that starts with #. The code starting at the continuator is parsed as a primary expression, and added as an additional argument to the original call.
  3. Top-level expr: an identifier followed by any expression that does not start with ( or an infix operator, e.g. return 0.

Plus, we can eliminate the need for semicolons with a similar rule to that described above.

The first and second rules let us write C-style executable statements, with or without a space after the “keyword”:

if (expr) {exprs;}
if (expr) {exprs;} else {exprs;}
try {...} catch(...) {...} finally {...}
for(...) {...}
switch(...) {...}
do {...} #while (...)

Note the need for # before while, because while is not a continuator.

You could also write things like this:

x = switch (y) { 0 => "zero"; 1 => "one"; };

which is illegal in LESv2.

The third rule covers things like var, new, return, break and import:

var foo = (new Foo()); // parentheses are required around `new` expression
break outerLoop;
import net.loyc.syntax.@*;
return a + b;

But users would have to understand that (unlike in LESv2) return (a + b) * c would have the unintended meaning (return(a + b)) * c.

This plan has a major limitation, as it provides no nice syntax for type declarations and function declarations. Things like this can still be parsed (although semicolons are now required):

fn foo(bar: i32) {...};
fn foo(bar: i32) -> baz {...};
struct Foo {...};
struct Foo : IFoo!T {...};

But their syntax trees become a little weird:

fn(foo(bar: i32, {...}));
fn(foo(bar: i32) -> baz({...}));
struct(Foo({...}));
struct(Foo : ((IFoo!T)({...}));

Because of this, the plan is hard to endorse as-is.

Therefore, I investigated a more elaborate set of ideas (see the next section).

“I want to write x = new Foo() or i32.reinterpret_f32 $N without parentheses

I developed a syntax that satisfies this desire along with the previous one, but it’s a relatively complicated proposal so I’ve split it out onto its own page.

“I want operators with letters in them.”

LES already has a mechanism for this: backticks. You can write x`foo`y, which means foo(x, y). But let’s explore alternatives anyway.

As well as signed and unsigned operators in WebAssembly ($x >s $y), operator suffixes could provide an interesting way to create named operators, as in f(x) :where x > 0 (meaning (f(x)) `:where` (x > 0)) - but new users could get confused that f(x) : where x > 0 is completely different ((f(x)) : ((where(x)) > 0)). The other downside is that we’d always need spaces between operators and their arguments. That’s especially a problem for prefix and dotted expressions like -x and foo.bar. So if we really wanted to do this, we would need to compromise by saying that certain operators like . and - can’t have suffixes.

To avoid these problems, we could have “escaping” of letters and words in operators. For example, if we select \ as our escape character, then >\s would be an operator named >s and \where would be an operator named where.

Another possibility is '; this is discussed at the end of the juxtaposition proposal.

Personally, though, I think backquotes are fine. I certainly hope Wasm developers will not get so hung up on a little punctuation as to reject the wider benefits of LES.

Whether we stick with backquotes or not, one remaining issue is the precedence of operators that contain letters. Currently, all backquoted operators have the same precedence, which is immiscible with many other operators (e.g. a `foo` b + c is illegal, because it’s unclear if you meant (a `foo` b) + c or a `foo` (b + c)). Since an operator like >s has a normal operator embedded inside, perhaps the initial punctuation characters should be used to decide the precedence of the operator.

“I don’t like semicolons. Let’s use newlines instead.”

LES could be changed to work that way, and it may have a significant advantage, because we would no longer have to think about accidental mis-parses caused by a forgotten semicolon, and we wouldn’t need any “semicolon insertion” rules. However, this change would also kill JSON compatibility since

{ "foo"
  : ["bar"] }

would suddenly meaning something else.

If newline is a terminator, its effect should be nullified after a line of whitespace, or an open brace, or inside parentheses…

{              // newline is ignored here
  Foo(x + y,   // newline is ignored here
      a + b)   // newline is a terminator
}              // newline is a terminator

unless, of course, the user opened braces inside the parentheses:

{               // newline is ignored here
  Foo(x + y,    // newline is ignored here
      {         // newline is ignored here
        a = A() // newline is a terminator
        a + b   // newline is a terminator
      })        // newline is a terminator
}               // newline is a terminator

You can always add parentheses to any expression, so this rule would suffice, although one could argue that we need a more elaborate rule to cover cases like

x = Foo() +
    Bar() +
    Baz()

Another option is a line continuator, let’s say \, written as

x = Foo()
\ + Bar()
\ + Bar()

or in the more traditional way,

x = Foo() \
  + Bar() \
  + Bar()

“How will we add new literal types in the future?”

Since literals in Loyc trees can contain anything, you can add new literal types without changing the LES parser by adding a postprocessing stage. For instance, you could support byte literals like bytes("61 62 63 00") by (1) adding a postprocessor that finds the bytes operator and replaces it with a literal containing a byte array, and (2) adding a preprocessor before converting a node to text that replaces all byte literals with calls to bytes.

But this doesn’t entirely solve the problem, because round-tripping is imperfect. For instance, if you construct a one-argument call bytes("AB"), serialize it and deserialize it again (with your postprocessor attached), you’ll get a byte array back rather than the original call.

So, we should have a plan for how new literal types can be added, and where possible, old versions of the parser should be able to handle new literals. Bonus points if new literal types can be round-tripped by old code. Here’s my idea about that:

“Can we add a general mechanism for suffixes?”

LES hasn’t really given a meaning to the backslash \ yet, so we could dedicate this for marking suffixes. But what should the precedence of such an operator be? In any case, the backslash itself should probably be included in the name of the operator stored in the Loyc tree, so that %x (equivalent to @%(x)) would not be the same thing as x\% (equivalent to @`\%`(x)).

I’ve always wanted a programming language that supported unit types like “metres”, “px”, “dp” and “MB/sec”, and the most natural way to express this is with a suffix. To that end, the suffix marker \ could be followed not just by punctuation, but by any fancy identifier (that is, letters, numbers and punctuation). The LES parser wouldn’t care whether a suffix like 12.3\metres is to be treated as an “operator” or a “unit”.

Precedence issues with WebAssembly

LES does not permit custom syntax, but you can exploit its built-in syntax creatively. That’s what I did when I proposed the following ways to express certain operators:

function foo($x : i32) : i32 {...}
br exit => result_value                // unconditional branch
br exit => result_value ? condition    // conditional branch
br_table default | [a, b, c] : $index  // branch table (switch) 
f32.store [$addr,0] = 0x0p0            // store into memory

In the current version of LES, the first four are superexpressions, so if they appear within a larger, outer expression, they must appear within parentheses. However, these parentheses would almost never be needed since br and br_table do not return a value to the outer expression, and for a function there cannot be an outer expression.

The first two would parse as intended in all contexts, as long as we eliminate the need for a semicolon after the function’s closing }, as described earlier. The second one has the structure br(exit => result_value). The left-hand side of => has high precedence, which could be a problem if the left-hand side were an arbitrary expression, but left-hand side is merely a label so nothing can go wrong. The right-hand side of => has the lowest precedence, which is exactly what we want; if you write br exit => $1 = $1 + $2 it has the structure br(exit => ($1 = ($1 + $2))): the entire result expression remains a child of =>, as it should be.

The conditional branch has the structure br(exit => (result_value ? condition)), so as long as the condition and result_value don’t disrupt that structure, all is well. A typical expression like br exit => $z & 255 ? $x == $y preserves that outer structure, since ? is a low-precedence operator. However, if you use an assignment like one of these:

br exit => $x * $y ? $x = $y // first case
br exit => $x = $y ? $x > $y // second case

An assignment, the only thing with lower precedence than ?, disrupts the structure as follows:

br(exit => (($x * $y) ? $x) = $y)  // first case
br(exit => ($x = ($y ? ($x > $y))) // second case

I’ve been thinking that the left-hand side of = should have a higher precedence than the right-hand side; by increasing it, the problem in the first case would disappear, but the second case is still borked. Thus if this syntax were adopted, extra parentheses would be required in certain cases, a fact that could confuse people writing Wasm. As a result, it’s probably best to drop the punctuation in favor of either the basic

br_if(exit, result_value, condition);

or possibly this:

br (exit => result_value) if (condition);

br_table has the same problem, but this time it can be solved consistently if the precedence of the left-hand side of assignments is raised.

Finally, f32.store[$addr,0] = -0x0p0 will sometimes need parentheses around it unless the precedence of the left-hand side of = is raised quite high, since for example $x * f32.store[$addr,0] = -0x0p0 is currently parsed as ($x * f32.store[$addr,0]) = 0x0p0. So I think the precedence should be raised quite high (probably to just above *). The only reason not to raise it would be slavish devotion to the precedence rules of existing languages. In practice, existing languages give a semantic error if you write something like x * y = 0, so the potential for changing the meaning of existing code when you paste it into LES is low.

Hold your horses!

We’re not quite done yet: we have to consider the effect of the proposed changes in the separate document. What effect would that have?

End.

See also: Wasm issue

Comments