Nested Lookahead Redux

(The TLDR: Nested lookahead was always broken in legacy JavaCC. This is finally fully addressed (as of 11/26/2022) in JavaCC 21. However, since this fix has a lot of potential to break existing code, for now, the fix is only in effect if you put LEGACY_GLITCHY_LOOKAHEAD=false at the top of your grammar(s). Meanwhile, you should really check whether your grammars continue to work with that set.)

Better Late than Never!

I might as well motivate this discussion with a little example:

TREE_BUILDING_ENABLED=false;
BASE_NAME="Test";

INJECT TestParser :
{
    public static void main(String args[]) {
        new TestParser(System.in).Root();
    }
}

SKIP : " " ;

NestedChoice :
   SCAN 2 "foo" "bar" "baz"  
   |
   "foo"
;

Root :
    SCAN NestedChoice => {System.out.println("\nlookahead succeeded");} NestedChoice
    |
    {System.out.println("\nlookahead failed");}
;

The above is a self-contained example that you can build:

java -jar javacc-full.jar whatever-filename
javac TestParser.java

And you can test it from the command-line with:

java TestParser

At this point, you can input a line to pass into the parser. For example:

java TestParser
foo bar baz 

The above example was designed to illustrate a certain point. Consider the NestedChoice production. It contains a binary choice: it will consume the three tokens "foo", "bar" and then "baz" (that is the first choice) OR it will consume a single "foo" token, the second choice.

However, there is a key point to understand about the first choice. It specifies a 2-token lookahead, which indicates that it checks whether the next two tokens off the stream are "foo" and then "bar". It does not check for the "baz". If the next choice is not a "baz", it does not go to the next option, which is to consume a single "foo". It hits an error. The reason is that, once we have matched the expansion's predicate, which in this case, is the two tokens "foo" and "bar", we are committed to this branch and thus, we will not go to the next choice.

So, here is the summary of what the NestedChoice production can match:

  • foo bar baz
  • foo (and then something other than bar)

And here is what the production does not match:

  • Anything that does not start with foo
  • foo bar (followed by something other than baz)

So, again, if the next tokens off the stream are "foo", "bar", "foo", we are going to hit an error. It does not matter that the second option, the lone "foo" would have worked. Granted this could be confusing, but this is how the tool's logic works. If we wanted to check the first three tokens in the first option, we could have written:

 SCAN 3 "foo" "bar" "baz"

or:

 "foo" "bar" "baz" =>||

Now, let us consider the other production in the grammar, Root. It is also a binary choice. The first option starts with:

SCAN NestedChoice =>

which means that we are going to scan forward to see if a NestedChoice can be matched. Here is where the plot thickens. If we run our test harness

 java TestParser

and we feed it the input:

 foo bar foo

We get this output:

 lookahead succeeded
 Exception in thread "main" ParseException:
 Encountered an error at (or somewhere around) input:1:9
 Was expecting one of the following:
 BAZ
 Found string "foo" of type FOO
     at TestParser.handleUnexpectedTokenType(TestParser.java:512)
     at TestParser.consumeToken(TestParser.java:504)
     at TestParser.NestedChoice(TestParser.java:203)
     at TestParser.Root(Test.javacc:20)
     at TestParser.Root(TestParser.java:228)

This is clearly erroneous. As I explained above, the lookahead SCAN NestedChoice should have failed on this input! But it succeeded because the choice logic when scanning is not the same as when parsing. When you are parsing, the "foo" "bar" sequence must be followed by a "baz" for the production to successfully parse. However, when we are scanning ahead to check the predicate in the Root production, the lookahead succeeds. This is because, nested lookahead is ignored.

In any case, there is a real problem here. When a user writes:

SCAN NestedChoice => NestedChoice

it is completely reasonable for him to expect that if the lookahead succeeded, the parse will also succeed (N.B. There is the caveat here that we are talking about a situation where the expansion is purely syntactic. If the expansion contains conditions or actions expressed in Java code, one can easily envisage a situation where the lookahead succeeds but the parsing fails. Or vice versa. But that is really outside the scope of this post.)

Now, this is part of a general problem in which nested lookahead is simply not dealt with correctly in JavaCC. This is a longstanding problem. The biggest part of the problem (at least from a pragmatic viewpoint I daresay) was addressed in JavaCC 21 a couple of years ago. However, the issue discussed here remained.

The new LEGACY_GLITCHY_LOOKAHEAD setting

The issue has been dealt with in code. However, the fix is quite clearly backward-incompatible. Almost certainly, a high proportion of legacy grammars actually depend on this glitchy lookahead behavior. So, for now, I have left the ability to turn on the legacy behavior of ignoring nested lookaheads. (Just the non-syntactic ones to be precise.) In fact, for now, the legacy behavior is on by default. To turn it off, you have to use the new setting LEGACY_GLITCHY_LOOKAHEAD, i.e. put:

 LEGACY_GLITCHY_LOOKAHEAD=false;

up top of your grammar. When the upcoming transition to the Congo rebranding is done, this setting will be off by default. Thus, in terms of the sample grammar above, if you put the LEGACY_GLITCHY_LOOKAHEAD = false; up top and rebuild the parser, you will see that the example works as it really should.

This also affects zero-or-one and zero-or-more constructs.

This issue is also present with other constructs.

For example, with something like:

 SCAN "foo" ["bar" "baz"] "bar" "bat" => Foobar

With the older legacy (glitchy) lookahead behavior, the above lookahead would accept the following input:

foo bar bat

It matches the foo and then hits the optional ["foo" "baz"] but since the following two tokens are "bar" "bat", not "bar" "baz", it skips the optional part and then happily consumes the bar bat.

BUT... this is not correct. Once it goes into the optional construct ["foo" "baz"], the (implicit) one-token nested lookahead is that we match the inner expansion "foo" "baz" if the next token is "foo", which it is. So, again, we get committed to this branch and we fully expect the next token to be a "baz"!

If we have LEGACY_GLITCHY_LOOKAHEAD turned off, then to get the prior behavior, we would need to write:

SCAN "foo" ["bar" "baz"=>||] "bar" bat"

Well, properly understood, this is actually not a separate case. After all, the construct [X] is really the same thing as: (X | {}), a choice between the expansion in question and simply consuming/scanning nothing. So, if we rewrote the above as:

SCAN "foo" ("bar" "baz" | {}) "bar" "bat"

It should also be clear that the same issue occurs with a repeated construct, so if we had:

SCAN "foo" ("bar" "baz")* "bar" "bat"

there is the same legacy glitchy lookahead issue. With the older legacy behavior, we would accept:

  foo bar baz bar bat

However, again, we would have to rewrite our lookahead as:

 SCAN "foo" ("bar" "baz" =>||)* "bar" "bat"

to get the older behavior. And, for the sake of completeness, I should point out that a one-or-more construct, like (X)+ is just a convenient way of writing X (X)* so we can see that this is not really a separate case.

"What do I need to do?" (And why?)

For now, the LEGACY_GLITCHY_LOOKAHEAD setting is on by default, so you don't absolutely have to do anything. I anticipate changing this to being off by default once we move to the Congo branding. So, at that point, if you want to use the most up-to-date version of the tool, the minimal action you need to take would be to put

LEGACY_GLITCHY_LOOKAHEAD=true;

at the top of your existing grammars. Well, of course, you might even discover that your grammar(s) continue to work even without that because maybe you didn't have any constructs that depended on the older, glitchy behavior.

All that said, the recommended current course of action is, to get a jump on all this (potential) problem, just put:

LEGACY_GLITCHY_LOOKAHEAD=false;

at the top of your existing grammars and see if this breaks something. If it does, then you really ought to narrow in on where this happens and fix it. Most likely, it is not a very big edit. It would be something like what you see above, explicitly putting an up-to-here in the expansion that was being used this way in a lookahead.

The reason to address this issue is that, at least for complex grammars, you should be better off if we do have the basic contract that a lookahead succeeding means that the parsing will succeed. It will be easier to reason about complex nested constructs that you may have to write. This kind of longstanding glitch, where nested lookahead was never really working correctly, tended to make the writing of JavaCC grammars (at least beyond a certain level of complexity) into a kind of black art. So, at the risk of incurring a bit of transitional pain, we're (at long last!) getting this squared away.