WebAssembly text format: LESv3

Background: WebAssembly is a new technology being standardized for running native applications (e.g. C/C++/Rust) on the web. It used to be an expression-based language that was typically viewed as an abstract syntax tree (AST) — basically a high-level assembly language, moreso than C ever was. Recently, however, the design shifted into a postorder “stack machine” which supports all the code that could have been expressed as an AST, plus some “oddball” code that is not expressible as an AST (details).

The text format of WebAssembly is not yet standardized; this article is the second followup to my proposal to use LES, an open-ended multi-purpose code format. The current version of LES is simpler and JSON-compatible, while some ideas here add complexity and are tailored more specifically for WebAssembly.

Right now I’m gathering opinions. How complex should LES be? How valuable is backward compatibility with JSON? How acceptable is space-sensitivity? This article is long, but at the end there will be a survey so you can have your say in the WebAssembly text format.

Now that Wasm is moving toward a stack machine, I think that LES is now more relevant than ever, because there seems to be a need for not one Wasm language but two: one for Wasm itself, and another for the pseudo-wasm AST used by producers (i.e. binaryen). Both languages would support expressions, but each would interpret them differently. In the “official” Wasm, $a() + $b() could be syntactic sugar for the sequence $a(); $b(); i32.add if both functions return i32, whereas in compiler tools like binaryen, the same text would represent the syntax tree i32.add($a(), $b()).

While only one Wasm variation is expected so far, it’s not hard to imagine others:

A general-purpose syntax like LESv3 allows people to do all this without touching the parser or the printer (serializer), let alone writing new ones.

This post summarizes my current ideas for LESv3 syntax, especially as they relate to WebAssembly, while highlighting a few questions for the community.

What is LES? A one-paragraph summary

If LISP had been invented in the 90s as a native member of the C family, LES would be its parser. Like the s-expression, LES is parsed into a simple data structure called a Loyc tree, but unlike s-expressions, the data structure is designed to hold code. Instead of using a “list” as the recursive element, Loyc trees use a “call”: f(x, y) instead of (f x y). f is called the “target” of the call and although most targets are identifiers, a target can be any node, including another call (e.g. f(x)(y)). Also, every node has an (optional) list of attributes attached (which can include “trivia” such as comments, and in Wasm might be used for debug info), and should also hold a source code range for use in error messages (e.g. ~/code/my_code.les:1012:12:1012:19).

In short, LES is not a programming language, it’s a data format that represents code in a natural way.

Like LESv2, LESv3 will support general C-like expressions with function calls, infix, prefix, and suffix (++ --) operators and indexing (possible syntax for stores: i32[$x] = $y). It will have JS-like [lists], maybe {dictionaries}, and tuples. Operators are represented by calls to identifiers that start with a single quote, e.g. 2 + 3 is a call with a target called '+. In the current notation, 2 + 3 could also be written as `'+`(2, 3).)

Relationship to WebAssembly

I’ve been optimizing LESv3 to make it a good basis for the WebAssembly text format and I am publishing this in the hope of getting your feedback, opinions and preferences. I’d much rather have your opinion now than after I’ve written separate parsers for multiple languages!

Keep in mind that LES is also designed for general-purpose non-Wasm uses. That’s the main reason for syntax differences between LES and the older proposal by Dan Gohman (and Michael Bebenita?), now developed in this repo. For example, in that proposal, plain words like foo are reserved for future use as keywords, whereas $foo is an “identifier”. In LESv3 (and in this document!), plain words like foo are considered identifiers, while $foo is just an identifier with the prefix operator $ in front.

Wasm stack machine in LESv3, by example

Let’s consider the C example

int sumIntegers(int* input, int length) {
  int sum = 0;
  for (int i = 0; i < length; ++i) {
    sum += input[i];
  }
  return sum;
}

The optimized s-expression I got from Wasm Explorer is, after adjusting for readability,

(module
  (memory 1)
  (export "memory" memory)
  (export "_sumIntegers" $_sumIntegers)
  (func $_sumIntegers (param $input i32) (param $length i32) (result i32)
    (local $sum i32)
    (set_local $sum (i32.const 0))
    (block $stop
      (br_if $stop (i32.lt_s (get_local $length) (i32.const 1)))
      (set_local $sum (i32.const 0))
      (loop $unused $loop
        (set_local $sum 
          (i32.add (i32.load (get_local $input)) (get_local $sum)))
        (set_local $input (i32.add (get_local $0) (i32.const 4)))
        (br_if $loop (set_local $length 
          (i32.add (get_local $length) (i32.const -1))))
      )
    )
    (return (get_local $sum))
  )
)

Manually updating this to stack-machine form, I get

  (memory 1)
  ...
  (func $_sumIntegers (param $input i32) (param $length i32) (result i32)
    (local $sum i32)
    (i32.const 0) (set_local $sum)
    (block $stop 
      (get_local $length) (i32.const 1) i32.lt_s (br_if $stop)
      (i32.const 0) (set_local $sum)
      (loop $loop
        (get_local $input) i32.load (get_local $sum) i32.add (set_local $sum)
        (get_local $input) (i32.const 4) i32.add (set_local $input)
        (get_local $length) (i32.const -1) i32.add (tee_local $length) (br_if $loop)
      )
    )
    (return (get_local $sum))
  )

In LESv3, this stack-machine code could be expressed direcly as follows (omitting get_local and i32.const, which are redundant):

  .memory 1;
  ...
  .fn $_sumIntegers($input: i32, $length: i32): i32 {
    $sum: i32;
    0; set_local $sum;
    {
      $length; 1; i32'lt_s; br_if stop;
      0; set_local $sum;
      loop (loop) {
        $input; i32'load; $sum; i32'add; set_local $sum;
        $input; 4; i32'add; set_local $input;
        $length; -1; i32'add; tee_local $length; br_if loop;
      }
      stop:
    }
    $sum; // return value
  }

More typically it would use expression notation, like this:

  .memory 1;
  ...
  .fn $_sumIntegers($input: i32, $length: i32): i32 {
    $sum: i32;
    $sum = 0;
    {
      br stop 'if $input '<s 1;
      $sum = 0;
      loop (loop) {
        // I picked := for set_local (which discards the result) 
        // and = for tee_local (which puts the result on the stack);
        // feel free to vote your own preference.
        $sum := i32[$input] + $sum;
        $input := $input + 4;
        br loop 'if $length = $length + -1;
      }
      stop:
    }
    $sum; // return value
  }

The Wasm assembler can allow the two notations to be freely mixed (similar to how the spec tests do today), e.g.

$length = $length + -1; // $length is left on evaluation stack
br_if loop; // branch to loop if $length is nonzero

LESv3 syntax elements

Now that you know what it looks like, let’s discuss the details.

Semicolons

Should semicolons should be required to terminate statements? If semicolons are required then a LESv3 parser could read JSON files; if statements are terminated by newlines then JSON isn’t supported because

{ "foo"
  : "bar" }

will be parsed like { "foo"; (:"bar"); } which is quite different. Maybe a postprocessing step could restore the intended meaning, but instead you should probably just use a dedicated JSON parser in the first place.

For various technical reasons, it seems easier if newline is a terminator, so I am inclined to drop JSON support. If newline is a terminator, semicolon can still be used as a separator or terminator, which is useful when writing postorder notation or declaring multiple variables.

Update: They’re trying to standardize Wasm quickly, so the browser people want to represent Wasm with a flat instruction list. I’d say a flat list looks better without semicolons, e.g.

$a
$b
i32'add

Keywords

LESv2 has no keywords. Although Wasm doesn’t need this, in LESv3 I decided to introduce exactly three keywords, true, false and null, because I found that writing a special notation like @true was cumbersome for end-users, while relying on a postprocessing step to translate true into a true literal made it difficult or impossible to make an identifier that happened to be named true. In LESv3, true is a literal and `true` is an identifier named true.

Fancy identifiers

An identifier is a name (e.g. foo). Normal identifiers have letters, digits, underscores (_) and/or apostrophes ('). The first character must be a letter or _.

In LES it has always been possible to use any arbitrary string as an identifier. But WebAssembly has an even more stringent requirement: allowing any arbitrary byte string as an identifier. These byte strings will be interpreted as UTF-8, but some of them are not character strings.

Dan Gohman’s prototype used backslashes to escape individual characters in an identifier. That prototype doesn’t support standard escapes like $line1\nline2 for an identifier with a newline in the middle, whereas I thought identifiers should be escaped similarly to normal strings (except that you could write, for instance, fun\ fact\! to get an identifier with a space and exclamation mark in it).

In LESv3, my plan had a problem. If true is a keyword then we need a way to write an identifier called true. Using \true to represent the identifier called true is a no-go because \t conventionally represents a tab character, so this particular identifier wouldn’t have its obvious meaning of “tab character followed by rue”.

Instead, I decided that rather than escaping each individual character, identifiers can be enclosed in backquotes and parsed exactly as a string, e.g. `I'm a whole sentence!` is an identifier. Compared to the alternative, this rule gives longer identifiers in some cases and shorter ones in others. It’s perfect for representing C++ mangled names like `?Fmyclass_v@@YAXVmyclass@@@Z`, but less efficient when there’s single strange byte like `\x1B`.

So in LESv3, all strings will be interpreted as UTF-8, and \x lets you write an arbitrary byte. This means you can write invalid UTF8 like \xFF, but also valid UTF-8 such as \xE2\x82\xAC which means the same thing as \u20AC. One subtle problem with this plan is that some languages (JavaScript, Java, C#) store strings as UTF-16, and I do not wish to define identifiers as byte strings rather than strings. I solved this problem with a scheme that encodes invalid UTF-8 as invalid UTF-16 in a way that is guaranteed to round-trip properly.

Keyword expressions

LES has no keywords for creating functions, classes or other things. In lieu of keywords, LESv2 used two different kinds of left parenthesis (one with a space in front, the other without), which help distinguish “superexpressions” (such as a function declaration) from normal expressions.

// LESv2
if x { f(); };   // Superexpression equivalent to `if(x, {f();});`
if (x) { f(); }; // Superexpression equivalent to `if((x), {f();});`
if(x, y);        // Call a function named `if`
if (x, y);       // Syntax error with suggestion to remove the space
fn Foo(x: i32);  // Superexpression equivalent to `fn(Foo(x: i32));`
fn Foo (x: i32); // Syntax error with suggestion to remove the space

Although no one from the Wasm group objected to this, no one supported it either. I decided to try a different design for v3 that avoids any possible confusion. At first I planned to use # to denote “keywords”:

#fn Foo(x: int32) {...}

But # is a relatively noisy character to look at, and you need the shift key to make one. My new plan is to use a dot, which I think is a little easier on the eyes and hands:

.fn Foo(x: int32) {...}

This makes some sense, as . is used for “directives” in some assembly languages. Following the dot, a “keyword statement”, as this is called, has a single expression followed by an optional braced block with optional “continuator clauses”. Here are some examples of keyword statements, a.k.a. “dot expressions”:

// Dot expression with expression
.return $x + 1;
// Dot expression with expression and braced block
.struct Point { x: f64; y: f64; };
// Dot expression with expression, braced block and continuator
.if x > 0 { f(); } else { g(); };

A “continuator” is an identifier from a predefined set that includes else, catch and finally, that allows an extra clause to be added. The exact set of words allowed as continuators is not finalized, and the syntax of a continuator clause is not finalized either.

The dot in the keyword-expression becomes part of the name of the identifier that is called, e.g. .return $x + 1 really means `.return`($x + 1).

Potentially, a keyword statement could take comma-separated arguments:

.memory initial=1, maximum=10, exported=true;

However, LESv3 is a flexible language, so a keyword-expression is allowed anywhere that a normal expression is allowed, including as a function argument, which makes this ambiguous:

foo(bar, .baz a, b, c);

Does this function take four arguments, or two? The same issue arises if we try to keep JSON compatibility, since you could write something like

{"key":"value", .baz a, "b":"c"}

Block calls

LESv3 borrows a feature from Enhanced C# and takes it one step further. In Enhanced C#, you can write a “block call” which is a function call with a pair of braces afterward:

unless(IsKitchenClean) { CleanKitchen(); }

This allows users to add new constructs that look built-in but are not. In both Enhanced C# and LES, the resulting syntax tree is a normal call with the braced block added as an extra parameter:

unless(IsKitchenClean, { CleanKitchen(); }); // equivalent

(The braced block itself, if you were wondering, is a call whose target is `'{}`.)

A block call can appear in the middle of an expression:

    str = "<" + switch(x) { A => "Eh?"; B => "Bee"; ... } + ">";

In Wasm, this feature could be used for loops:

   loop (label) { infinite(); br label; }

LESv3 adds a new feature: the braced block can be followed by a continuator clause. These clauses will have the same syntax as the continuator clauses on the keyword-expressions you saw earlier:

    x = if (c) { a; } else { b; };

If we keep both “keyword expressions” and “block calls” in LESv3, you can choose between two styles:

    // Block call (C style): parens and braces required
    if (c) { a; } else { b; };
    // Keyword expression (Rust style): only braces required
    .if c { a; } else { b; };

For Wasm, I’ve chosen to prefer the first syntax style for the contents of function bodies, and the second style outside functions. Note that the two forms have different call targets (the first one calls if while the second calls `.if`.)

Currently, keyword-expressions cannot start mid-expression, i.e. you can’t write x = .foo y {...} and must write x = (.foo y {...}) instead, but this restriction could be lifted.

Just so we’re clear, continuators are not keywords, they are just words that you wouldn’t expect to see after } except for the purpose of continuing the previous statement. The exact syntax and output tree for a continuator clause has not been finalized; here’s one possibility:

    try { a; } catch (e: Exception) { b; };
    // ..could be equivalent to..
    try({ a; }, `'catch`(e: Exception, { b; }));

Juxtaposition

I’m considering whether to support a third special syntax in LESv3: a unary juxtaposition operator, which allows any unadorned identifier (i.e. no $, no .) to act like a (high-precedence) unary operator:

// These two lines are equivalent
$x = log sqrt $y * $z;
$x = log(sqrt($y)) * $z;

Juxtaposition is a handy feature, but it’s a fallback. It’s not compatible with everything else and if another interpretation applies, the other interpretation will be used instead. The identifier that you want to use as an operator can only be followed by an identifier or a nonnegative literal. For example:

foo 12     // Juxtaposition:      foo(12)
foo $x     // Juxtaposition:      foo($x)
foo -x     // Normal subtraction: (foo) - x
foo x.y    // Juxtaposition:      foo(x.y)
foo (x).y  // Normal call:        (foo(x)).y
foo $bar() // Juxtaposition:      foo($bar())

In the postorder code above, you may have noticed that the names of some operators have changed: i32.lt_s is now i32'lt_s and i32.add is now i32'add. That’s because LES, like most other programming languages, defines dot (.) as an operator and not as part of an identifier. In most contexts the dot causes no trouble, but it’s not compatible with juxtaposition notation since the left-hand side must be an identifier. The single-quote, on the other hand, is permitted in identifiers (it’s treated the same way as a digit.) Alternatively we could use underscores: i32_add.

Technically it’s possible to parse code like this, dots and all:

i32.eqz i32.clz $x

But I don’t think we should, because it either increases the complexity of the LES parser’s grammar from LL(2) to LL(*), or requires challenging tricks in grammar actions. In fact, my parser already relies on a couple of tricks and I want to avoid adding more (one trick efficiently supports an infinite number of operators with over 20 precedence levels. Another is a flag that controls whether block-calls with {braces} are allowed; block calls are not allowed inside keyword-expressions.)

Other options: we could avoid using the juxtaposition feature, or drop it from the language entirely.

What about opcodes like i32.trunc_s/f64 and i32.reinterpret/f32? Naturally, slash is an operator, so its name must change. I suggest i32'trunc_s_f64 and i32'reinterpret_f32.

Labels

Remember that : is a binary operator, as in $x: i32. How is it possible to also use it for labels? The answer is not simple. In fact, I originally planned to use labels with a colon prefix, as in :label, but if newlines do not act as end-of-statement markers, then this plan doesn’t really work as I will explain later. So I gave it some thought and found a combination of novel tricks that will allow the familiar label: syntax.

First, I realized that if : has a higher precedence on the left side than the right side, it be used for labels and produce a sane syntax tree. This is basically the same trick used in C# so that

f = a => b = a; // Lambda

parses as

f = (a => (b = a));

Similarly, we can contrive precedence rules so that

label1: 
  br loop 'if $length = $length + -1;
label2:
  x = c ? a : b;

parses like

label1: ((br loop) 'if ($length = ($length + -1)));
label2: (x = (c ? (a : b)));

The colon is still treated as a binary operator, which leads us to the second trick: using a postprocessing step to separate the label from the expression that follows it. This shouldn’t be a big deal: since labels don’t really exist in Wasm anyway (only blocks), a postprocessing step is already needed.

The final problem is that label: is currently illegal if there is no expression afterward. To solve this, we can define a special rule that a colon can be used as suffix operator, but only if there is no expression on the right-hand side.

Frankly, this is a lot of trouble to get the familiar syntax. If we drop JSON and use newlines as terminators, I’m inclined to drop the tricks and use : as a prefix operator (which the parser already allows): :label. This isnt’ very workable if newline is not a terminator, though. For starters, it’s fairly ugly to have two punctuation marks:

:label;

But there’s a deeper problem because : is also an infix operator. If the input is

$x = 1
:label;

Then unless we treat the newline specially, this will get parsed in a silly way:

$x = (1 : label);

Note that labels, unlike local variables, don’t need a $ to mark them as labels, since they appear in restricted contexts and there’s no harm in defining a label named after an opcode (e.g. grow_memory:).

Update: I should point out that if we use newlines as terminators, only one trick is required: treating colon-followed-by-newline as a suffix that ends an expression.

An (uglier) alternative is to avoid the colon entirely. For example, if # is defined as an ordinary identifier character (like a letter of the alphabet) but is reserved for use by labels, the following code is parsed as two separate statements as we desire:

br #stop 'if { $input '<s 1 }
#label;

Custom/alphanumeric operators

LES has always had an “infinite” number of operators, since any combination of operator characters ([!%^*+=|<>/?:.$&~-]) can be an operator, whose precedence is chosen by baked-in rules.

Outside Wasm, you would expect r>s to be parsed as r > s. But Wasm integers are not signed or unsigned, so Wasm has an unusually good reason to put letters into operators to indicate their signedness. To resolve this tension between Wasm and “normal” languages, LES does allow letters in operators, but only if the operator is specially marked.

In the new syntax, an operator can contain both operator characters and identifier characters ([A-Za-Z0-9_], excluding the single quote) if it begins with a single quote. Thus $f() '>s g() is a signed comparison.

Single quotes can either start a character literal like 'A' or an operator like 'A; the difference between them is the lack of a closing quote (three characters of lookahead are sufficient to distinguish between them.)

Alphanumeric operators also allow you to construct sentence-like expressions, like this:

    br stop 'if $input '<s 1;

I recently changed the precedence rules so that an operator like <=s has the precedence you would expect.

“Word” operators like 'if, have two possible precedences depending on whether they start with a lowercase letter. If the operator starts with a lowercase letter, it has very low precedence and is right-associative. This example would be parsed like so:

    (br stop) 'if ($input '<s 1);

(Upon seeing the 'if operator, the assembler would look for the br call in its first child node to figure out that it’s a br_if operation.)

The above example produces no value. Here are some ways that other br constructs could be encoded in LES:

br exit;                     // unconditional branch, no value
br exit($value);             // unconditional branch with value
br exit 'if condition;         // conditional branch, no value
br exit($value) 'if condition; // conditional branch with value
// Branch tables (syntax is designed to put index after value)
br         'table $index [a, b, c] | d; // no value
br($value) 'table $index [a, b, c] | d; // with value

This assumes you can use AST notation. In case AST notation is not possible because of a “void” value between the opcode and its values, alternate syntax(es) are needed. My current idea is to have a pseudo-opcode pop and pop N that represent one or more values to pop. Then one could write, for instance,

$result_value();
$condition();
drop $do_something_else();
br exit(pop) 'if pop;

Instead of (or in addition to) a “friendly” branch syntax, I think there should be a “basic” form, at least one that omits all stack elements:

br_if exit;
br_table(1 /*arity*/, [a, b, c], d /*default branch target*/);

Currently, operators that are 'words cannot be used as prefix or suffix operators, e.g. 'drop $foo() is illegal. This restriction leaves the door open to having ' as a prefix for arbitrary continuators in keyword-expressions (to avoid ambiguity between unary 'word-ops and continuators).

Attributes

“Attributes” are nodes that are independently attached to other nodes. Any node in the syntax tree (whether it’s an identifier, literal or call) can have associated attributes. Attributes can be “out-of-band” metadata such as debug information and comments, as well as “attributes” in the C# sense or “annotations” in the Java sense.

Currently, attributes can only appear at the very beginning of an expression and they apply to the whole expression. They consist of an @ sign followed by a “particle” (identifier, literal, braces, brackets or parentheses). The following example shows the kinds of attributes you can write:

    // Identifier as attribute
    @Mathy .fn Square(x: i32) { x * x; }
    // Literal as attribute
    @"Increment" $x := $x + 1;
    // Braces or list as attribute
    @{a;b;c} set_local($x, 1);
    @[a,b,c] set_local($x, 1);
    // Arbitrary expression as attribute
    @(f(x, y)) ($x = 1);
    // Multiple attributes on a single expression
    @123 @0x123 @Numeric 
    $x = 1 + $x;

Currently, in order to apply an attribute to a child node, parentheses are required. Inside parentheses, a new expression is considered to begin:

   $x = (@plus (@why $y) + (@zed $z));

Note: Java annotations can have an argument list, e.g. @Foo(123). In LES this is not allowed, because (unless we pay attention to whitespace) there would be an ambiguity between @Foo(123) (f())() (Foo has an argument list) and @Foo (f())() (no argument list intended). You must write @(Foo(123)) instead.

Custom Literals

This isn’t especially important for Wasm, but I’d like to mention my plan for literals. As a format for future programming languages, LES should have a mechanism to allow new kinds of literals to be defined. Meanwhile, LES parsers that don’t support a given kind of literal should be able to round-trip it from text to text.

In addition to true, false, null and character literals, LES has three other literal syntaxes:

  1. Numbers with optional type suffix, e.g. -1234, 0x1234_5678_9ABCi64
  2. Strings with optional type prefix, e.g. "Hello, world!", s"symbol", re"regex"
  3. @@ literals, e.g. @@nan.f, which is intended for named literals and includes booleans like @@false and inifinities like @@inf.d. These literals are parsed the same way as the single-quoted operators introduced above.

LES unifies these three kinds of literals into a single concept. When scanning any of these literals, an LES parser can store the literal in a type like this:

// ES6
class CustomLiteral {
    // type prefix or suffix
    public typeMarker;
    // An LES parser is allowed to keep all values in string form, 
    // but if the type marker is known to be numeric, it is acceptable
    // to parse it into a number and store that here instead.
    public value;
};

A custom literal need not keep track of whether the literal was originally written as a number or as a string, because numeric and string literals are equivalent. The string u"0x12345" is an acceptable way to express the number 0x12345u, while the number number 1234.5 is equivalent to the string number"1234.5" (i.e. number is the default suffix for numbers if one is not present). Finally, an unknown @@ literal like @@hello-world! can be expressed as the string `@@`"hello-world!" (a type prefix/suffix can be any identifier including a backquoted one; in this case, the type prefix is the backquoted identifier @@.)

While the printer is always allowed to print a literal back out in the form of a string, for readability it is recommended to print it as a numeric literal if the literal is “known” to be a number (e.g. the type marker is recognized as numeric or the LES implementation tracks number-ness. Just be careful not make invalid output by miscategorizing a string as a number; I can imagine security risks from such a mistake.)

Proposed literal type markers (not necessarily supported by Wasm):

Ordinary strings do not have a type marker, but we could reserve the empty type marker `` for strings.

Other ideas:

If the parser recognizes the type marker but the value fails to parse into that type (e.g. 0xFFFF0000i32, which overflows), the parser may print an error and should store it as a string in a CustomLiteral object.

Room to grow

The following syntactic elements are unused, allowing potential future use:

Plus:

Miscellaneous issues

Conclusion

I’m finishing up my parser and unit tests for LESv3 in C#. Before writing/porting parsers for other langauges I’d like to settle some of the open questions:

Questions for the Community

Take the survey!

Take the survey!

Update: 2016-09-11

Based on the survey results,

In other news,

Note: Normally, a newline is allowed after a binary operator. However, in order to decrease the chance of the parser accepting code with a mistake in it, like

     x = y + z z
     y = z

A newline will not be allowed after one of these word-operators and will cause a syntax error. Also, since word operators allow almost any sequence of names and punctuation to parse as a valid expression, as a default lint, the parser could print a warning if a single source file uses the same symbol as an identifier and also as an operator.

Comments