Revisiting LOOKAHEAD in JavaCC, Attacking the Dmitry Dmitriyevich Problem

As a result of some private correspondence with one of the PMD developers (PMD makes extensive use of JavaCC) I started thinking about some issues that I had in the back of my mind to look into at a later point.

Having recently ripped out the code in JavaCC 21 that supported putting LOOKAHEAD specifications at non-choice points, I was informed that this was a change (possibly not the only one) that prevented PMD from switching over to JavaCC 21. To tell the truth, I was rather surprised by this. I had just assumed that practically no JavaCC grammars in the wild really used this. Besides, the usage pattern gave off a funny smell. Basically, you define a phony grammatical production (one in which the parser never enters) in order to use it in LOOKAHEADs.

The need to do things like this is suggestive of some design failings. Well, truth told, I don't think that there is any absolute need to do this as things stand, even when using legacy JavaCC. However, all of this got me thinking about how LOOKAHEAD works in JavaCC. As far as I can tell, despite this being one of the most basic things in the tool, it has never been revisited.

Okay, starting from first principles, JavaCC, at least by default, assumes a LOOKAHEAD of 1 token. Whenever there is a fork in the road and it must decide which path to take, the baseline assumption is that looking at the next token in the stream is enough information to make this decision.

Well, sometimes it is and sometimes it isn't. But if it isn't, you're supposed to specify LOOKAHEAD. That makes sense as far as that goes, but actually, I daresay that all of this has been one of the most annoying aspects of using the tool. For one thing, it requires us to insert explicit LOOKAHEAD specifications in places where the tool really ought to just infer them. For example, if you write something like:

  
  (
    (<IDENTIFIER> "{")
    |
    (<IDENTIFIER> "(")
  )

the thing bitches and moans until you put LOOKAHEAD(2) up top. In fact, the tool can perfectly well deduce that it needs to scan ahead 2 tokens to resolve the choice. Finally, as far as I can see, there is no real reason that it should not just deduce this and just quietly put in the code to look ahead the necessary 2 tokens. As best I understand it, this is really just some sort of longstanding shibboleth.

Now, when it comes to having to overspecify things, as is frequently the case with LOOKAHEAD in JavaCC, it occurs to me that this is an example of what I have called in the past the Dmitri Dmitriyevich problem, a.k.a. the Ivan Ivanovich or Vladimir Vladimirovich problem. (This is just my own private terminology, but hey, maybe it could become popular out there!)

The origin of the term is one of my (possibly dubious) attempts at a witticism some years back, when I commented to a collaborator that, increasingly, I found that reading Java code was like reading one of those huge Russian novels in which the characters all have these sorts of long, seemingly repetitive names. In particular, constructs like:

   Map<String, String> lookup = new HashMap<String, String>();

Well, at some point, one of the many geniuses employed at Sun Microsystems figured out that the second specification of the type parameters in the line above was superfluous and the compiler could simply deduce it. So now you can just write:

   Map<String, String> lookup = new HashMap<>();

Well, that took them 7 years (from JDK 1.5 in 2004 to JDK 1.7 in 2011) but the genius boys figured it out. So much for that particular Dmitry Dmitriyevich problem. R.I.P. (By the way, I actually do know that in these Russian names, the second name is a patronymic, the name of the father, so it is not really repetitious and superfluous. It just looks that way when the son is named after the father. In the line of Java code above, by contrast, the repetition of the type parameters really is redundant and pointless!)

But getting back to JavaCC specifically, we frequently encounter things the following Dmitri Dmitriyevich construct:

   LOOKAHEAD (Foo() [Bar()] Baz())
   Foo()[Bar()]Baz()

It so happens that, in a very high percentage of the cases where you write a LOOKAHEAD with some fairly complex expansion with non-terminals inside, the very same expansion is simply repeated afterwards. Or even not such complex things:

    LOOKAHEAD(Foo()) Foo()

Now, granted, the expansion after the lookahead is not always identical. For example, it would make perfect sense to write:

   LOOKAHEAD ("try" "{")
   "try" CodeBlock()...

In the above case, we only need to look ahead 2 tokens to decide whether to enter the expansion. There is no need to do a syntactic lookahead all the way to the end of the last catch or finally block, which would be rather expensive and pointless. However, the case where the expansion inside the LOOKAHEAD is identical to the one that follows it is common enough that it ought to receive some special consideration. So, in the latest version of JavaCC 21, if the expansion following the LOOKAHEAD is identical, you don't need to specify it, so you can write:

   
LOOKAHEAD()
   Foo() Bar() Baz()

and it is the same as if you had written:

   
LOOKAHEAD(Foo() Bar() Baz())
   Foo() Bar() Baz()

 

In fact, it occurred to me that LOOKAHEAD() was still too verbose, so I introduced a new keyword (though this could be open to some discussion). I chose "SCAN" so you can also just write:

   SCAN Foo() Bar() Baz()

I'm interested in some feedback on that. It also has occurred to me since then that even just using a one-character operator, like maybe ^ could be even better. I don't really know whether that is better than using an English language keyword like "scan". I thought to try to put to vote whether people prefer to write:

   LOOKAHEAD(Foo() Bar() Baz()) Foo() Bar() Baz()

or:

   SCAN Foo() Bar() Baz()

or

   ˆFoo() Bar() Baz()

Somehow, I don't think very many people would vote for the first one! However, it will always remain an option!

Notable Replies

  1. Hi. For me:

    • ok for lookahead() foo() bar() baz() (empy args -> full semantic la)
    • against for lookahead foo() bar() baz() : no args could be reserved for other future specific cases
    • against scan foo() bar() baz(): no interest adding a new keyword
    • agains ^ foo() bar() baz(): no interest in a cryptic character; keep it for other future uses
    • ok for lookahead(~…)
  2. Hi
    Why LOOKAHEAD() and not LOOKAHEAD?
    To keep close to the current syntax : The general structure of a LOOKAHEAD specification is: LOOKAHEAD(amount, expansion, { boolean_expression }):

    • it makes it easier for IDEs (code completion, formatting) to handle one syntax (always with parenthesis) than two (with or without),
    • is it easier for the beginner to learn only one syntax.

    SPECIAL_TOKENS : we need them in IDEs (for reformating, syntax highligthing). A good improvement would be to chain them differently to the other tokens: in this order:

    • specials after normals and on the same line as them : chain them in a “end” list (end of line comments or block comments, usually related to the current line)
    • specials before normals and on the same line : chain them in a “beg” list (block comments at the beginning of a line, usually part of the line)
    • specials before normals on the previous line : chain them in a “prev” list (javadoc,block or line comments, usually related to the next line)
    • specials on lines with at least one blank line before normals or at eof : chain them in an “orphan” list (usually more global comments)

    SKIP and SPECIAL_TOKENS: a very good improvement would be to transparently generate the tokens for them, to be able to generate the node tokens (for JJTree or JTB), to have them in the ASTs; currently, if we want to manage them (again we need them for formating / code completion in IDEs), we need either to declare them as normal tokens and pollute the grammar (with optional productions or lexical states), or write specific code to get the node tokens, the tokens and then the specials lists or even use the tokenmanager APIs to get the skipped whitespaces.

  3. As examples:

    See lines 674+ in https://sourceforge.net/p/eclipse-javacc/code-git/ci/master/tree/sf.eclipse.javacc/src-plugin/sf/eclipse/javacc/scanners/JavaCCCodeColorRule.java

    See lines 917+ in https://sourceforge.net/p/eclipse-javacc/code-git/ci/master/tree/sf.eclipse.javacc/src-plugin/sf/eclipse/javacc/editors/CompletionProcessor.java

    Lookahead() is already handled. Lookahead not. Small extra work when so much needs to be done elsewhere will go in last priority, for me.

    User will always have to think and choose between Lookahead and Lookahead() when editing with code completion… Currently he has only one proposal. No need an extra thinking.

  4. Another thing this got me wondering about is whether there should really be any need to write a Non-terminal with no args as Foo(). Possibly, Foo would be okay. I’m thinking about that now, but I am not going to make any very rash decision. The reason that you have to put the empty parentheses after a method call in Java is that foo() and foo really are necessarily different.

    But that is another question that I am not going to decide on right now. The basic thing here is that if you want to convince me of something, you have to make an argument based on what is good for the end user, not what is good for you! This has to do with the overall philosophical approach of any project that I am in control of.

  5. You know, as a strange addendum to this, I looked at the JavaCC (21) grammar and I see that this ability to write Foo Bar Baz instead of Foo() Bar() Baz() is already there!

    See: https://github.com/javacc21/javacc21/blob/master/src/main/grammars/JavaCC.javacc#L1645

    I must have rewritten that grammatical production to make the () optional. (Assuming there are no arguments, of course!)

    I was unsure how old this was and I checked back in the last FreeCC grammar (FreeCC 0.9.3 released early 2009) and the parentheses are optional there too! I must have thought to myself that there was no need to make anybody write the paretheses if there were no arguments, but then forgot about it later! And now I talk about this as a (possible) feature except that is how it currently works!

    I have no idea if this is a case of creeping senility.

    (N.B. All the above is orthogonal to the question of whether it is desirable to allow people to omit the parentheses or not. It’s just that I checked, and to my suprise, JavaCC21 allows you to omit the parentheses since back in the old FreeCC days!)

    Except I never documented this (probably other things were not documented as well) and the only way anybody could have discovered this is by accident! (Or by actually reading the JavaCC21/FreeCC grammar!)

    I just looked at the legacy JavaCC and the parentheses are mandatory. They were mandatory in JavaCC 4.1, which is the codebase I forked from, I think. More or less… I notice also that the latest JavaCC grammar in the legacy project allows type parameters (optionally) in a non-terminal, so it can be:

    Foo<type1, type2>()

    See: https://github.com/javacc/javacc/blob/master/src/main/javacc/JavaCC.jj#L1304

    But I have no idea what the purpose of this is. I guess it must have something to do with supporting C++ or C#. In Java, a grammatical production translates to a Java method invocation and there is no place to put type parameters in a method invocation! Do you know offhand what the purpose of this is?

Continue the discussion at parsers.org

17 more replies

Participants