LES: Loyc Expression Syntax

Your DSL buddy & universal syntax tree

LES is an interchange format for syntax trees, comparable to s-expressions but designed for languages in the Algol family such as C, C++, C#, Java, EcmaScript, Rust and Python. It can be described as “JSON for code”: just as XML/YAML/JSON are tree structures that assign no particular meaning to the data inside, likewise LES represents syntax trees without assigning any particular meaning to them.

LES is a C-like language with {braced blocks} and expressions that end in semicolons; its parser is much simpler than C itself. The output of the LES parser is a list of expressions, in the form of tree structures called Loyc trees. In comparison with the syntax trees inside most compilers, Loyc trees are designed to have an almost LISP-like simplicity.

Here is an example of LES code:

  @[#static] 
  fn factorial(x::int)::int {
    var result = 1;
    for (; x > 1; x--) {
      result *= x;
    };
    return result;
  };

The above code uses a feature called “superexpressions” to make it more readable. The same code can be written without superexpressions as

  @[#static]
  fn(factorial(x::int)::int, {
    var(result = 1);
    for((; x > 1; x--), { result *= x });
    return(result);
  });

Or it can be written in “pure prefix notation”, which uses @identifiers-with-at-signs instead of infix operators:

  @[#static]
  fn(@'::(factorial(@'::(x,int)),int), @`'{}`(
    var(@'=(result, 1)),
    for(#tuple(@``, @'>(x, 1), @'--suf(x)), @`'{}`(@'*=(result, x)));
    return(result);
  ));

The last form most closely reflects the “real” structure of the code. After reading more about LES and Loyc trees, you will understand why these three forms are equivalent. If you put this code into LeMP.exe along with the following directive:

#importMacros(LeMP.Prelude.Les);

It will be transformed into the following C# code:

  static int factorial(int x)
  {
    var result = 1;
    for (; x > 1; x--)
      result *= x;
      return result;
  }

The C# code, in turn, looks like this when printed as LES code:

  @[#static] #fn(#int32, factorial, #(#int32 `#var` x), {
    #var(@``, result = 1);
    #for(@``, x > 1, x --, result *= x);
    #return(result);
  });

Note: A new LES is coming

This document describes LES version 2, but this will be replaced fairly soon. LES version 3 has been in progress for some time, but as of 2020/05 I finally have a solid block of time to dedicate to it. I’ve decided not to standardize the original LES3 as planned; instead, I intend to split LES3 into two versions, LESC (LES C-style) and LESP (LES Python-style) which will be very similar to each other but treat colons, newlines, and spaces differently. LESC will be simpler than LESP. LESC, like LES2, will be a superset of JSON.

LES2 and LES3 are both available in the C# NuGet package Loyc.Syntax.dll.

LES for configuration files

As a superset of JSON, LES is useful for configuration files. LES is more compact and flexible than JSON and XML. Compare these representations of the same data:

<?xml version="1.0"?>
<root>
  <owner name="John Smith">
    <address city="Calgary" prov="Alberta" country="CA", address="123 Fake St"/>
    <vehicle make="Ford" model="Pinto" year="1978" insurerId="1"/>
    <vehicle make="Toyota" model="Prius" year="2014" insurerId="1"/>
  </owner>
  <insurer name="Vehicle Insurance Kings" id="1">
    <address city="Calgary" prov="Alberta" country="CA", address="321 True Ave"/>
    <phone type="unknown">(403) 555-1234</phone>
    <phone type="tollfree">(800) 573-8941</phone>
  </insurer>
</root>
{
  "owners": [{
    "name": "John Smith",
    "addresses": [
      { "city": "Calgary", "prov": "Alberta", "country": "CA", "address": "123 Fake St" }
    ],
    "vehicles": [
      ["Ford", "Pinto", 1978, {"insurerId": 1}]
      ["Toyota", "Prius", 2014, {"insurerId": 1}]
    ]
  }],
  "insurers": [
    { "name": "Vehicle Insurance Kings", "id": "1", 
      "addresses": [
        { "city": "Calgary", "prov": "Alberta", "country": "CA", "address": "321 True Ave" }
      ],
      "phones": [
        { "type": "unknown",  "number": "(403) 555-1234" },
        { "type": "tollfree", "number": "(800) 573-8941" }
      ],
    }
  ]
}
// One of several possible choices of LES syntax
owner "John Smith" {
    address(city="Calgary", prov="Alberta", country="CA", address="123 Fake St");
    vehicle("Ford", "Pinto", 1978, insurerId=1);
    vehicle("Toyota", "Prius", 2014, insurerId=1);
};
insurer "Vehicle Insurance Kings"[id=1] {
    address(city="Calgary", prov="Alberta", country="CA", address="321 True Ave");
    phone("(403) 555-2341", type=unknown);
    phone("(800) 573-8941", type=tollfree);
};

LES works even better for configuration files that contain data that is “code-like” (e.g. conditions, queries, and commands). XML and JSON often require separate parsers for code-like data; for example,

<if condition="$Folder.Length &lt;= 255">...</if> 

is not only ugly compared to the LES equivalent

if $Folder.Length <= 255 {...};

but you’d need a separate expression parser, too.

Design goals

Grammars

A formal grammar of LES has not yet been published. The actual grammar used for parsing is written in LES and then converted to C# by LeMP. The grammar of LESv3 is a bit more readable. Meanwhile, this document defines LESv2 in an informal manner.

Comments & stuff

Comments in LES work like C#, C++, Java, and many other languages:

    /* This is a multi-line comment.
       It continues until there's a star and a slash. */
    /* A multi-line comment doesn't need multiple lines, of course. */
    profit = revenue - /* a comment inside an expression */ expenses;

    // This is a single-line comment. It has no semantic meaning.
    // It ends automatically at the end of a line, so you need
    // more slashes to continue commenting on the next line.

    /* Multi-line comments can be /* nested */ as you see here. */

Comments can be embedded in Loyc trees, and the LES parser can include comments and newlines in its output tree (using a mostly language-independent algorithm in StandardTriviaInjector).

Statements & expressions

An LES document is a sequence of “statements” separated, or terminated, by semicolons. In LES, unlike many other languages, there is no distinction between “statements” and “expressions”. However, there is a distinction between “top-level expressions” and “subexpressions”. Also, in some contexts, expressions are separated by semicolons, while in others they are separated by commas. The rules are fairly predictable for C programmers.

By convention, we may say that an expression is a “statement” if it appears at the top level or within a {braced block}, where it must end with a semicolon. It’s a stylistic issue only; it’s like in English you can say either

Bob made a sandwich; Alice ate it.

or

Bob made a sandwich, and alice ate it.

First, let’s discuss “normal” (sub)expressions. After that we’ll discuss top-level statements, which allow some additional syntax.

Subexpressions

LES supports the following kinds of subexpressions:

Here is a typical LES subexpression, showing that braced blocks can be embedded in expressions:

function(first_argument,
    second_argument,
    third_argument + {
        substatement();
        another_one(x, y, z)
    }
)

You may occasionally see the term “Prefix notation”. Prefix notation is a subset of the expression grammar that includes only identifiers (including special ones), literals, calls, and attributes (introduced below). Any Loyc tree can be expressed in prefix notation; for example, p = A!B[X,Y] can be expressed in prefix notation as

@'=(p, @`'_[]`(#of(A, B), X, Y))

and System.Console.WriteLine("Hello") can be expressed as

@'.(@'.(System, Console), WriteLine)("Hello") 

which is parsed like (@'.(@'.(System, Console), WriteLine))("Hello").

Top-level expressions

An LES file consists of a list of statements (i.e. top-level expressions) that are separated or terminated by semicolons (;), as if the file were enclosed in curly braces. It would be possible to allow either ; or , as a separator, but requiring semicolons allows the parser to catch certain mistakes earlier, without relying on indentation or newlines to choose error messages. For example, without this rule, the error on line 1 would not be detected until the end:

foo(2 * (x = y + z), bar   // line 1
if happy { 
  Smile(); 
} else { 
  Frown();
};

Every list of expressions in LES is a list of “top-level” expressions (whether it’s an argument list, a tuple, a braced block or the file as a whole). A top-level expression is special: it can have attributes attached to it, and can use the “superexpression” notation.

Attributes

avgSpeed = (@[miles] 0.25) / (@[seconds] 2.68);

Every node in a Loyc tree conceptually has an attribute list, which is a “side channel” for storing additional information. Every top-level expression can begin with a single list of attributes enclosed in @[square brackets]; the contents of the square brackets are parsed the same way as an argument list. For example, in

@[Foo(), X + 1] ++X;

Two attributes, Foo() and X + 1, are attached to the prefix expression ++X.

After the optional attribute list, each statement has either

  1. A normal subexpression (see the previous section), or
  2. A superexpression (see the next section)

Superexpressions

We might summarize the grammar of a superexpression as follows (in LLLPG’s ANTLR-style notation):

SuperExpr: TT.ID Expr Particle*;

In other words, it starts with an identifier, followed an expression, followed by zero or more “Particles”, where a Particle is an identifier, or something in braces, or something in parentheses that is preceded by a space, or a literal.

For example, this LES code contains six superexpressions:

function length(s) {
    if s == null {
      return 0;
    } else {
      var len;
      for (len = 0; s[len] != '\0'; len++) {};
      return len;
    };
};

Now that you know what a superexpression looks like, let’s talk about what it means. Quite simply, superexpressions are translated into ordinary calls:

function(length(s), {
   if(s == null, {
      return(0);
   }, else, {
      var(len);
      for((len = 0; s[len] != '\0'; len++), {});
         return(len);
      });
   };
};

Remember that a superexpression is always a part of a list of expressions, so a superexpression is always terminated by ;, ,, EOF, or one of the closers ), ] or }. The last part of a superexpression does not allow “full” expressions, in order to maximize the chances of getting a syntax error if you forget a semicolon after }.

This is the key tradeoff of LES. C allows you to write

while (false) {}
if (c) {} else {}
x++;

with no semicolon after any of the closing braces, but LES requires an explicit signal that each statement is terminating. The above code will only give a parse error when it reaches ++, since operators are not allowed after the first braced block. The correct LES code is

while (false) {};
if (c) {} else {};
x++;

Since LES is a language with no keywords, it’s important to understand how the parser can distinguish superexpressions from subexpressions. Often it’s obvious: return 0 is not a valid subexpression, so it must be a superexpression.

But there is an important ambiguity. How can LES distinguish a normal expression like

WriteLine("Hello, world!");

from a superexpression like

WriteLine ("Hello, world!"); //?

The answer is very simple: if '(' is preceded by a space or tab (' '|'\t'), it is not allowed to represent a function call. For example, both of these statements are syntax errors (the parser will tell you what you did wrong):

// Syntax error. If a function call was intended, remove the space(s) before '('.
Console.WriteLine ("Hello, world!");
// Tuples require ';' as a separator. If a function call was intended, remove 
// the space(s) before '('.
Foo (X, Y);

This is based on the observation that in C/C++/C#/Java, most programmers write code like

Console.WriteLine(...);
var re = new Regex(...);
F(x);

with no space between the identifier and the method being called, but many (albeit fewer) programmers write

if (...) {...} else {...}
for (...) {...}
lock (...) {...}
while (...) {...}
switch (...) {...}

with a space between the keyword and the opening paren. So in a language that has no keywords, it seems reasonable, although not ideal, to use a spacing rule to identify whether the first word of an expression is to be treated as a “keyword” or not. Since the compiler will, in most cases, print an error when the spacing rule is not followed, anyone programming in LES will get into the habit of writing the space, or not writing a space, under the correct circumstances.

Note that both of these statements are legal and have virtually the same meaning:

Foo(x);  // Subexpression
Foo (x); // Superexpression

The only difference is that, in the second case, x is considered to be inside parentheses (i.e. Foo (x) is equivalent to Foo((x)), see ‘Subtleties’).

Forgotten semicolons

LES was actually redesigned to reduce the space of valid superexpressions, so that in most cases, the parser would give a syntax error if there is a missing semicolon after closing braces. For example, a statement like minOrMax = if (max) Max() else Min(); used to be legal, and no longer is, which allows mistakes like the following to be detected without analyzing newlines or indentation:

while (x < 100) { x *= 2; }
Foo(x); // Syntax error

do x++ while (Foo(x))
x++; // Syntax error

if (c) { c.F(); }
var x = 0; // Syntax error

if (c) { c.F(); }
if x > 0 { return 0 }; // Syntax error

Even so, sometimes when two superexpressions are placed side-by-side, the parser doesn’t see a problem:

while (x < 100) { x *= 2; }
return x; // No parse error

There are two main ways to mitigate this lack-of-an-error:

1. Front-end processing: when you write LES code you have some kind of reason to do so - usually you’re going to feed the code into a compiler or other tool (e.g. LeMP) after you’ve written it. Also, superexpressions are designed for a specific purpose: to simulate constructs like while loops and function definitions.

That means that whatever software receives the Loyc tree produced by the LES parser knows it is expecting to see a while loop with two arguments, an if statement with 2 or 4 arguments (if(expr, {...}) or if(expr, {...}, else, {...})), and so forth.

The symptom of a missing semicolon is a call that receives too many arguments. Therefore, the software that receives a superexpression with too many arguments can assume, if the earlier arguments look reasonable, that a missing comma or semicolon is to blame, and report something like “expected end of statement” at the first unexpected argument:

    while (c) { 
      c = F(); 
    }
    return 0; 
//  ^^^^^^ while loop: expected end-of-statement (did you forget a ';'?)

2. Syntax highlighting: coloring can visually distinguish superexpressions from normal expressions and function calls (supported in Visual Studio):

In this example there are three missing semicolons, but only one of them is a syntax error (you’ll get an error after the strlen function). The other two—after the for loop and after the while loop—are not errors, but they cause the return “keyword” to be shown in a different color (in this case, black rather than blue).

Precedence rules

TODO: update this section

The operator precedence rules are documented at LesPrecedence.cs.

In short, the characters in the operator determine the operator’s precedence, starting from the following basis rules:

See the full documentation for the rules about binary => ~ <>, `backtick`, and prefix / \ < > ? =.

The infix operators >>, <<, &, | and ^ are not allowed to be mixed with certain other operators, because their precedence is a potential source of bugs. For example, x & 7 == 0 is considered a syntax error, although the parser will still produce a tree from it (@'&(x, @'==(7, 0))). Backquoted operators have a similar restriction (for full details, see “Precedence”).

Literals

LES supports the following kinds of literals:

There is no literal notation for lists or tuples (note that a tuple such as (1; "two") is not a “literal”; it is a Loyc tree that some programming languages may or may not interpret as a tuple.) More literal types will probably be added in the future (such as unlimited-size integers, 8-bit and 16-bit ints, byte array literals).

A few languages, including Ruby and LISP, have a symbol data type; symbols are singleton strings, and they are implemented by the Symbol class in Loyc.Essentials. Symbols have a performance advantage over strings in many cases. The advantage of symbols is that you never have to compare the contents of two Symbols. Since each Symbol is unique, two Symbol references are equal if and only if they point to the same heap object. Loyc.Essentials allows derived classes of Symbol and “private” symbol pools, but LES supports only global symbols. Symbols are a useful alternative to enums.

All three of the syntaxes for strings produce the same data type (System.String), and the two kinds of triple-quoted strings are otherwise identical. Double-quoted strings like "this one" must be entirely written on a single line, while triple-quoted strings can span an unlimited number of lines.

Double-quoted strings support escape sequences; the syntax is the same as for C and C#:

I hate the idea of source code changing its meaning when a text editor switches between UNIX (\n), Windows (\r\n), and Macintosh (\r) line endings. Therefore, when a triple-quoted string spans multiple lines, each newline will always be treated as a \n character, regardless of the actual byte(s) that represent the newline in the source file.

In addition, after a newline, indentation is ignored insofar as it matches the indentation of the first line (the line with the opening triple-quote). For example, in

    namespace Frobulator {
       def PrintHelp() {
          Console.WriteLine("""
            Usage: Frobulate <filename> <options>...
            Options:
              --blobify: Blobifies the frobulator.
              --help: Show this text again.""");
       };
    };

The words “Usage” and “Options” are indented by two spaces. This is compatible with dot-space notation, so the following version is exactly equivalent:

    namespace Frobulator {
    .  def PrintHelp() {
    .  .  Console.WriteLine("""
    .  .    Usage: Frobulate <filename> <options>...
    .  .    Options:
    .  .      --blobify: Blobifies the frobulator.
    .  .      --help: Show this text again.""");
    .  };
    };

I decided that I wanted all the string types to support any string whatsoever. In particular, I wanted it to be possible to store "'''\r\"\"\"" as a triple-quoted string. Therefore, I decided to support escape sequences inside triple-quoted strings, but the sequences have a different format (with an additional forward slash). The following escape sequences are supported:

Then, to write an escape sequence literally, you must escape the initial backslash, e.g. '''\\/r/''' represents the three characters \r/, and '''\\/\/''' represents \\/. If the backslash and forward slash are not recognized as part of an escape sequence, they are left alone (e.g. '''\Q/''' is the same string as '''\\/Q/''').

Triple-quoted strings do not support unicode escape sequences. Please use the character itself, and note that the recommended encoding for LES files is UTF-8, with or without a “BOM”.

Subtleties

Parentheses & Style

If an expression uses parentheses for grouping, this fact is encoded in the tree by attaching a %inParens attribute, unless there is an attribute marker after the opening (. For example, the parser produces the same result for 2 * (x+1) and 2 * (@[@%inParens] x + 1). Similarly, the parser produces the same result for 2 * x + 1 and (@[] 2 * x) + 1. Note: the prefix % was #trivia_ in older code.

The C# Loyc tree implementation reserves 8 “style” bits, intended for tracking other style distinctions such as the difference between 11 and 0xB, and the difference between 2 + 2 and @'+(2, 2). These bits are the only mutable state in an LNode.

Note: it has been proposed that the style bits be removed to make the Loyc tree concept simpler. The flyweight pattern could be used instead, by using cached sets of one or more attribute to represent styles. For example, the number 0xF might be stored in memory with an attribute like @[@%base16]15, and the attribute list @[@%base16] could be shared between all hexadecimal numbers to save memory.

Style bits also indicate whether commas or semicolons are used as a separator/terminator in argument lists, braced blocks and lists in square brackets. The current implementation sets BaseStyle = NodeStyle.Statement if and only if semicolons are used.

Comma vs semicolon

There is a slight semantic difference between , and ;; a comma is always treated as a separator, while a semicolon is a separator or terminator. That is, { A; B } and { A; B; } both count as two statements. In contrast, while F(A, B) counts as two arguments, F(A, B,) counts as three (the third one is missing, and @`` is inserted as a placeholder in the syntax tree). And while ‘,’ is almost always used to separate arguments (parameters) in an argument list, ; is also allowed, in case the programmer has a stylistic reason to want to use it; F(A; B;) counts as two arguments, not three. Semicolons and commas cannot be mixed in the same context; F(A; B, C) is an error.

Why semicolons in tuples?

There are two reasons why elements of a tuple must be separated by semicolons instead of commas:

  1. The main reason is to make Foo (X, Y) into a syntax error, which prevents a programmer from accidentally treating Foo as a one-argument function.
  2. Note that Foo(x,) is considered to be a function of two arguments, but Foo(x;) is considered to be a function of one argument, because comma is a separator while semicolon is a terminator. We need a way to represent tuples of one argument. We could require one-arg tuples to be explicit (#tuple(x)), but some languages (e.g. Python) offer the syntax (x,) for this purpose. However, according the logic above, (x,) should be a tuple of two items in LES, while (x;) is more logically understood as a tuple of size one.

Grammar classification

The grammar of superexpressions overlaps that of subexpressions, because many infix operators can also be prefix operators; in case of ambiguity, normal expressions win. For example, X * Y could be parsed as a superexpression meaning X(*Y), but this does not happen, because normal expressions take priority.

The subexpression grammar of LES is LL(1), if superexpressions are omitted, except that additional error checks are required. LES with superexpressions is “augmented” LL(2). The grammar must be augmented with additional logic because a plain BNF grammar is ambiguous:

The spacing rule

Consider a superexpression like

Foo (x);

The original definition of LES decided whether this was a superexpression or not by comparing the character index of the end of the expression Foo with the character index of (. The new version of LES uses a slightly simpler rule: checking whether ( is preceded by a space or tab. This can be accomplished with a lexer rule that uses one token type for "(" and a different one for " (". The new rule gives a different interpretation in unusual cases like

Foo/* !!! */(x); // This is now considered a function call, not a superexpression

Currently a newline '\n' does not count as a space; it’s not clear whether this matters.

Just so we’re clear, LES has two spacing rules, usually represented by defining separate tokens for "(" and " (":

  1. A call cannot use " (" - remove the space before the left parenthesis.
  2. The particles in a superexpression cannot use "(" - add a space before the left parenthesis.

There are no special spacing rules that involve '[' or '{'.

Prefix & suffix precedence

TODO: explain how x * | y + z parses as x * (| (y + z)), examine suffix operator behavior.

JSON compatibility

LESv2 was made JSON-compatible via four changes to LES:

  1. Allow ',' as a separator inside {braces}
  2. Introduce [JSON, style, lists]
  3. Use @[...] instead of [...] for attributes (token literals were changed from @[...] to @{...}; note that using @(...) for attributes would not have clashed with token literals, but I felt it would be better to use @[...] because [] does not require the shift key, and token literals are a more obscure feature.)
  4. Do not give an error message for big integers (larger than can fit in 64 bits) (incomplete)

Although LES supports JSON syntactically, the structure of the tree produced by the LES parser is different from that produced by a JSON parser. In particular, LES has no concept of “dictionaries” or “objects”, so a “dictionary” like {"a":"A", "b":"B"} really means @`'{}`(@':("a","A"), @':("b","B")), i.e. each pair is represented by a call to the binary colon operator.

Shebangs

On Linux/BSD systems, a “shebang” is the two characters #! at the beginning of a file. These characters tell the operating system how to run a script. The LES parser ignores the first line of the file if that line starts with #!.

Using LES in .NET

To use LES in .NET, simply call LesLanguageService.Value (in Loyc.Syntax.dll) to get an IParsingService object that supports parsing LES text and printing Loyc trees as LES text. Call Print(node) to print and Parse(text) to parse.

Example:

IListSource<LNode> code = LesLanguageService.Value.Parse("Jump(); Ship();");
LNode firstStatement = code[0];
string code2 = LesLanguageService.Value.PrintMultiple(code);   // "Jump(); Ship();"
string first = LesLanguageService.Value.Print(firstStatement); // "Jump();"

You can also call Tokenize("text") to use the lexer by itself (it implements ILexer<Token> and IEnumerator<Token>; just call NextToken() which returns Maybe<Token>).

The parser returns a list of Loyc trees (IListSource<LNode>). See “Using Loyc trees in .NET”.