Adding "quick binding" to Enhanced C#, part 1

27 Mar 2016

The Common Subexpression Problem

Many years ago I noticed a common pattern with my “if” statements. I would use a complex or computed expression of some sort - usually a method or property call - and then, soon afterward, I would want to use the same expression again. For example:

if (list[i].SubItems.Count > _threshold)
   Foo(list[i].SubItems);
else {
   ...
}

When I’m coding, this pattern seems to happen multiple times a day. If I don’t want to incur the cost of evaluating the expression twice (keeping in mind that in .NET, the optimizer doesn’t factor out the common subexpression as reliably as you might like), I have to rewrite it as

var subitems = list[i].SubItems;
if (subitems.Count > _threshold)
   Foo(subitems);
else {
   ...
}

This takes time. Plus, in my opinion, the new code is not as readable as the original code. The readability problem in this particular case is quite small, but it tends to grow as the complexity of the “if” condition grows (you’ll see that soon).

The necessary refactoring is:

select the subexpression,
cut it to the clipboard,
choose a variable name (subitems) and write it in place of the original expression,
insert a new line, and
add var subitems = and paste the subexpression.

Sometimes it’s a lot more complicated, though. Consider a slightly modified version of the original code:

if (i < list.Count && list[i].SubItems.Count > _threshold)
   Foo(list[i].SubItems);
else {
   ...
}

The instructions above would give us code that is clearly wrong:

var subitems = list[i].SubItems; // ArgumentOutOfRangeException!!!
if (i < list.Count && subitems.Count > _threshold)
   Foo(list[i].SubItems);
else {
   ...
}

Instead we have to “move up” the test for i < list.Count. Something like this:

if (i < list.Count) {
   var subitems = list[i].SubItems;
   if (i < list.Count && subitems.Count > _threshold)
      Foo(list[i].SubItems);
   else {
      ...
   }
}

This is still wrong, though. See the problem? It’s the else clause. The else clause is supposed to run if either of the if conditions are false. So now we have to refactor it again… maybe something like this:

SubItemType subitems;
bool flag = i < list.Count;
if (flag) {
   subitems = list[i].SubItems;
   flag = subitems.Count > _threshold;
}
if (flag)
   Foo(subitems);
else {
   ...
}

Wow! That’s ugly. Sometimes we have to go to great lengths just to factor out a common subexpression. And there’s a compiler error hidden in this refactoring - can you spot it? The anti-readability of this code is obvious, since one line of code has ballooned to 7.

As a real life example, consider this code, which I managed to find in a matter of seconds when I started looking through my own code:

static Symbol ChooseFieldName(Symbol propName)
{
   string name = propName.Name;
   char first = name.FirstOrDefault();
   char lower = char.ToLowerInvariant(first);
   if (lower != first)
      name = lower + name.Substring(1);
   return GSymbol.Get("_" + name);
}

In this function I’ve explicitly factored out common subexpressions into local variables. If I hadn’t done that, the code would have looked like this:

static Symbol ChooseFieldName(Symbol propName)
{
   string name = propName.Name;
   if (char.ToLowerInvariant(name.FirstOrDefault()) != name.FirstOrDefault())
      name = char.ToLowerInvariant(name.FirstOrDefault()) + name.Substring(1);
   return GSymbol.Get("_" + name);
}

Even though this code is longer and will run slower, it is, to me at least, slightly easier to understand. I think that the variable declarations for first and lower are a cognitive burden for the reader because some of the context information has been removed: when you look at first, you have no idea what its purpose is. The if statement is what gives a purpose to this variable, but the if statement doesn’t appear until two lines later. Therefore, it takes you longer to see that the code converts the first letter of name to lowercase (but only if the first letter changes as a result).

Solution

Some languages, such as Go, have a := operator that creates and assigns a variable at once. In C#, the natural equivalent of := would be

if ((var subitems = list[i].SubItems).Count > _threshold)
   Foo(subitems);
else {
   ...
}

But with the extra parentheses and everything, it is a little unweildy. Several years ago I found the optimal solution to this problem. It looks like this:

if (list[i].SubItems::subitems.Count > _threshold)
   Foo(subitems);
else {
   ...
}

I call :: the “quick binding operator”. The :: operator already exists in C#, but you could code for years without ever using it; I’m just proposing that we “overload” this operator with a new behavior, whenever its original behavior does not apply.

Using :: is not just shorter, it has beter workflow, too. As soon as you write Foo, you notice that you’re using the same expression again. So you simply go to the line above and add ::subitems to save it to a temporary variable. No cutting and pasting, and the name “subitems” is repeated only twice, not three times as in the original code. Plus, the example above that uses && stays simple:

if (i < list.Count && list[i].SubItems::subitems.Count > _threshold)
   Foo(subitems);
else {
   ...
}

In my opinion, subitems should (in general) also be available in the else clause, and even after the end of the if statement. After all, if you were writing the code by hand, it would be, and it’s more generally useful if it’s accessible afterward. So, should it be scoped to the first part of the if statement, the entire if statement, or to the outer block? Well, this seems like a decision I can put off until later.

Here’s how it looks for the ChooseFieldName example above:

static Symbol ChooseFieldName(Symbol propName)
{
   string name = propName.Name;
   if (char.ToLowerInvariant(name.FirstOrDefault()::first)::lower != first)
      name = lower + name.Substring(1);
   return GSymbol.Get("_" + name);
}

Implementing this in EC#

The motivation to support this feature goes beyond simple variable declarations, and it goes beyond the code you write yourself. Consider the ?. operator. Today the ?. is part of C# 6, but before that it was implemented as a LeMP macro for Enhanced C#. When you wrote

Foo(Bar?.Baz);

A macro would convert this to

Foo(Bar != null ? Bar.Baz : null);

But of course, this translation could be wrong: Bar is evaluated twice, but it should only be evaluated once. Originally I planned to solve this using a feature called “block expressions”. The ?. macro would generate code like this:

Foo({ var Bar_13 = Bar; Bar_13 != null ? Bar_13.Baz : null });

Note: 13 would be a global counter used to produce unique variable names.

Notice that the final expression has no semicolon, which is used to mean “return a value from here”; it’s an idea copied directly from Rust. Originally I was going to use this same syntax to simplify function return values:

static double Square(double x) { x*x }

However, this syntax is a bit redundant now that C# 6 has lambda-style functions:

static double Square(double x) => x*x;

So now I’m now thinking it’s better to avoid complicating the parser with new syntax that we don’t really need. Instead I’m thiking of using this syntax instead, which re-uses the in operator that is already used for other purposes:

Foo((Bar_13 != null ? Bar_13.Baz : null) in { var Bar_13 = Bar; });

The in {...} operator would be a bit like the where clause in Haskell (it’s just easier to call it in because in is a keyword, and where is not.) The second part (in the braces) runs first, and then the first expression gets the final result. Arguably it’s better to have a left-to-right execution order, so we could consider other syntaxes, like

Foo({ var Bar_13 = Bar; out Bar_13 != null ? Bar_13.Baz : null; });

Foo({ var Bar_13 = Bar; } => Bar_13 != null ? Bar_13.Baz : null);

It may look like these are “new” syntaxes, but they’re not. Both of them are re-using pre-existing parsing rules that other macros are already relying on.

All of these possibilities have an… issue. By using braces, the implication is that a new scope is created, so the variable Bar_13 should exist within that scope and disappear afterward. However, we need to access the value of Bar_13 after the scope has ended so that we can get its value out.

I had thought of dealing with this problem with “variable renaming”. The idea is: “let’s eliminate the braces so we can use Bar_13 in the call to Foo”. Rather than actually using braces to create a new scope, we’ll strip out the braces, but give each variable within the braces a new name so it doesn’t conflict with anything outside. C# doesn’t allow you to declare anything other than variables in an executable context, so renaming variables is sufficient (we need not watch out for methods and properties, for instance). However, in this particular case, Bar_13 is already a unique name because it’s a compiler-generated variable. This leads me to ask: hang on, do we really need the braces at all?

The braces provide an obvious syntax for saying “I want to execute a statement inside an expression”. However, end-users don’t really need this feature; it’s intended mainly as a mechanism to help macros work. So now I’m thinking, let’s forget the braces and just have a pseudo-function called #runSequence or something like that:

Foo(#runSequence(var Bar_13 = Bar, Bar_13 != null ? Bar_13.Baz : null));

Now, how is all this related to our :: quick binding operator?

Well, given a binding like this:

if (list[i].SubItems::subitems.Count > _threshold)

It can be rewritten as

if (#runSequence(var subitems = list[i].SubItems, subitems).Count > _threshold)

Therefore, it can be handled the same way as any other sequence of statements that a macro might produce.

Whatever syntax we use, actually implementing it will be challenging. More on that when I write part 2.

Comments at Reddit - feedback was lukewarm, so I’ll probably put this off a long time or only implement it partially.

Loyc

Adding "quick binding" to Enhanced C#, part 1

The Common Subexpression Problem

Solution

Implementing this in EC#

Possibly Related Posts