Introducing the New Lookbehind Feature!

If you liked LOOKAHEAD, then you’ll love LOOKBEHIND!

Actually, I have not added LOOKBEHIND as a new keyword in JavaCC. It is just an informal way of referring to a new feature that I recently added.

I implemented this because I realized that there was a very common thing people want to do in a JavaCC grammar that simply cannot be done in a simple, robust way.

It comes down to the fact that certain constructs are naturally re-entrant and others are not. I suppose some readers will recall that, in the very first versions of Java, you could not have a class with a class. (Other readers were not born at that time.) Regardless, you probably all know that there are now inner classes. So they changed their minds. However, there are still a host of things that are not re-entrant. For example, you cannot have a method declaration inside of another method declaration. Well, come to think of it, not directly. You can define an anonymous inner class inside of a method and then declare another method in that inner class.

So, anyway, if you do not want a grammatical production to nest recursively, how does one go about this with JavaCC? I guess a very typical way to deal with this would be to create a variable to keep track of whether we are already in a certain kind of grammatical production — Foo, let’s say…​

So, suppose we define a variable called inFoo. In JavaCC21, you would naturally use the INJECT feature:

INJECT PARSER_CLASS : {boolean inFoo;}

Then, in the Foo production, you set (and later unset) the variable. In legacy JavaCC syntax, this would look like:

void Foo() :
{
   inFoo = true;
}
{
   ...
   {inFoo = false;}
}

And this allows you to use "semantic lookahead" elsewhere, something like:

void Bar() :
{
   ...
}
{
   ...
   [LOOKAHEAD(Foo(), {!inFoo}) Foo()]
}

I provide the last two code snippets in legacy syntax, but moving on from here, just about any example I give will be in the newer streamlined syntax, in which the above snippets look more like:

Foo :
  {inFoo = true;}
  ...
  {inFoo = false;}
;

And now, as per here, the LOOKAHEAD is better written now as:

[SCAN {inFoo} => Foo]

Regardless, it is a rather cumbersome disposition. It requires you to have code in three different places — you inject the variable definition in your XXXParser class, you set and unset the variable in your Foo production, and then you use the variable elsewhere as "semantic lookahead".

Well, hey, at least it works!

Or does it???

Well, it might work…​ most of the time…​ You see, the damnedest thing about this is that it is not really very reliable. For example, if you use a deeply nested LOOKAHEAD from a point before the parser has even entered the Foo production, then inFoo variable, when it is checked within the lookahead routine has not been set!

So, here is how we could express the above with the new "lookbehind" feature:

[SCAN ~\...\Foo => Foo]

What we have above after SCAN is a little expression to express a condition (predicate more formally speaking) that we only enter the following expansion (Foo in this case) if, scanning backwards from where we are now, we do NOT (the ~ negates the lookbehind) find a Foo. In other words, the above "lookbehind" disallows recursive re-entry into Foo. (At least in this exact spot.) Note also, that aside from the advantage of only having to express this in ONE place instead of three, it works reliably from any deeply nested scanahead.

Lookbehind is already being used internally in JavaCC development. You can see an example here.

You can see there a typical use case. JavaCC’s LOOKAHEAD/SCAN instruction is a classic example of a construct that does not nest. You cannot (at least for now, never say never…​) define a LOOKAHEAD or SCAN statement when you are already in one. Thus, we have a lookbehind to disallow it.

SCAN ~\...\Lookahead
=>
la=Lookahead

Note that you can perfectly well combine lookbehind with other predicates such as syntactic and semantic lookahead. The order of the new SCAN directive is:

SCAN 
(optional integer to indicate maximum tokens to scan ahead) 
(optional java code expression to indicate so-called semantic lookahead)
(optional lookbehind expression) (optional syntactic lookahead)
=> Expansion

By the way, I have decided that lookbehind can only be used with the newer SCAN keyword. In general, any new lookahead/lookbehind sorts of features will (or any new features generally, come to think of it) will, from hereon, typically only be available if you use the newer syntax.

Syntax

The syntax of lookbehind expressions is fairly simple. If they start with a tilde (~) the expression is negated. If the expression starts with (leaving aside any tilde) a backslash (\) it means you are scanning from the current production up to the root. If it starts with a forward slash ('/') then it starts with the root production and goes in the other direction. So, if you were in Java code, let’s say, you could have a statement like:

SCAN \MethodDeclaration\ClassDeclaration\CompilationUnit/ => Expansion

This would mean that we only enter the expansion if we entered the root production (CompilationUnit by convention) and then entered ClassDeclaration and then entered a MethodDeclaration and then our current production (whatever it is) and now we are deciding whether to enter some Expansion.

The above could also be equivalently expressed going forwards as:

SCAN /CompilationUnit/ClassDeclaration/MethodDeclaration\

The final slash above means that the current grammatical production must have been invoked from a MethodDeclaration directly. No intervening grammatical productions. If we wanted to permit intervening productions, we could write:

SCAN /CompilationUnit/ClassDeclaration/MethodDeclaration\...\

But actually, if we just leave off the final bit, it means the same as the above, so you could typically write:

SCAN /CompilationUnit/ClassDeclaration/MethodDeclaration

for the same effect.

So, another example would be that if we didn’t want to allow a construct to be an an inner class, we would write something like:

SCAN ~\...\ClassDeclaration\...\ClassDeclaration =>

The …​ means that we can have an arbitrary number of intervening calls.

Addendum

I came about the idea for this feature in a rather serendipitous manner. The generated parser now maintains a call stack of the nonterminals you enter but I did not have this "lookbehind" feature in mind at all! I added that call stack to be able to generate better error messages. However, at some point, it occurred to me that the call stack could also be used for this feature, scanning back to see how we arrived at the current production and using this information to form lookbehind predicates!

It is possible to enhance the little mini-language here to make it more powerful, and that is likely in the future, but I did not want to get into speculative generality. If there is end user feedback expressing a need for the enhancement with genuine well motivated use cases, then rest assured. It will be enhanced.

One thing that is currently not implemented either with lookbehind or the conventional syntactic lookahead is the ability to combine more than one in a predicate. There is no obvious reason not to be able to chain more than one of them.

SCAN ~\...\MethodDeclaration & /CompilationUnit/.../InterfaceDeclaration

(There is actually no obvious reason not to be able to chain multiple syntactic lookahead expansions this way either!)

But that is currently not implemented. Again, if there is significant end-user demand for this (or I myself feel a need for it in internal JavaCC development) I suppose it will be added in short order! It is low-hanging fruit…​

Start the discussion at parsers.org