Moving Towards a Maximally Correct Reference Java Grammar

Now that the ability to generate fault-tolerant parsers is coming along so well, I have been thinking about what to do with the Java grammar included in CongoCC. I decided that the best thing to do was to do the incremental work to make it maximally correct. Ideally, it will serve as a reference implementation that other projects can easily incorporate for their own use.

I think that the current version is pretty much an absolutely faithful implementation of the latest Java Language Specification for Java up to JDK 15 20. You don't have to take my word for it. You can try it for yourself:

  git clone https://github.com/congo-cc/congo-parser-generator.git congo
  cd congo/examples/java
  ant test

At this point, you can run:

 java JParse

which is the test harness for the latest version of this Java grammar. You can give it a single file to show you the AST or you can give it multiple files or a whole directory to parse. If you run it with the -t parameter, it runs in fault-tolerant mode. In particular that is interesting because it shows you the AST that is built for various erroneous constructs -- assuming you feed it some invalid java source file. It's pretty fast. If I unzip the src.zip from the JDK somewhere, I can run it over the whole thing like so:

 java JParse <java src root>

and on reasonably current hardware, it parses the whole thing, 17,619 java source files, in under a minute. You can try on your own machine and tell me the result. I reckon that's fast enough. Do you?

So, I really do encourage people to play with this, both in fault-tolerant and fault-intolerant mode and report back to me if they notice anything that seems amiss.

That would be useful but also it's kind of a fun little toy to play with. I think we should all try to have a bit of fun before they finally make it illegal. The history of that Java grammar file is also kind of fun, so I lay it out below, for those who are interested.

That's a Lady File with a History!

I reckon that a lot of banal, everyday objects have some fascinating history that we know little about. I was thinking about some curious object that you could have on your coffee table as a conversation piece. "Well, this curious object has a fascinating history you know.... not many people know that...." (Yeah, okay, and not many people care either.)

A while back, I put together (to the best of my knowledge) a history of the overall JavaCC project. It now occurs to me that the Java grammar included with CongoCC has its own fascinating history. That file was not written by me from scratch. It is a forward evolution (not the only one, by the way, one of several) of the Java grammar that was included in the legacy JavaCC distro back in the stone ages. (1997 or so...) I suppose it was originally just a grammar of the very early Java spec, Java 1.1, I believe, but at some point in the following years (I mean, something like between 1997 and 2006 or so) some work was actually done on it to keep it up to date with the evolution that Java was undergoing. Come to think of it, that may be one of the few useful things that the JavaCC project actually did in the years following it being open-sourced in mid-2003. After 2006 (approximately) even that activity ceased. So that Java grammar became increasingly out of date.

Now, here is a point to consider: one of the most infuriating of the many inanities of the legacy JavaCC project is that, in over 2 decades, they never implemented an INCLUDE directive so the codebase always violated the key principle of DRY (do not repeat yourself) in an utterly grotesque and flagrant manner. (Their own codebase but also that of anybody who did any work of significant scale using JavaCC...) You see, JavaCC's grammar is a superset of Java itself. And meanwhile, they had a separate tool called JJTree with its own separate grammar, except the grammar of JJTree was (and is) in turn a (fairly trivial) superset of the JavaCC grammar.

Given this, it would make perfect sense for the JavaCC project to only maintain a single canonical Java grammar that they could re-use. In that imaginary (sane) world:

they would commit to keeping the Java grammar current so that people could use it in their own projects
they themselves could just re-use it internally

Well, I suppose it would just be too easy and make too much sense to maintain a single reference Java grammar that could be re-used -- both internally and externally. So, no, they had two separate grammars for JavaCC and JJTree and each of those included within it all of the syntactic constructs of the Java grammar itself. And they had this separate sample Java grammar. All completely separate files, since there is no INCLUDE directive.

Now, here is where the history gets a bit more involved. Obviously, JavaCC is not the only project that needs a Java grammar. Surely lots of people out there need one for their own use. So, what happened is that various (actually, I know of two, but I think there are probably more that I don't know about) projects created and maintained their own fork of this sample Java grammar and separately maintained it. I am referring to two fairly well known OSS projects, JavaParser and PMD. The latter one is something of much bigger scope than simply maintaining a Java parser, but the former really is a separate project that exists because somebody by the name of Júlio Vilmar Gesser (with whom I am not acquainted at all) took the sample Java grammar from the JavaCC and started a separate project based on that. (The raison d'être of that project really seems to be little more than just maintaining a Java parser.)

Now, to be perfectly clear, there is nothing wrong with somebody doing that. In a sense, that is what open source is about. Well, except that, in this case, it does give one an uneasy feeling. (Or, it should, anyway...) You see, it really does not make sense that somebody has to fork off this sample Java grammar and start a new project. It really seems to me that the most minimal baseline of activity of a JavaCC sort of project should be to maintain their own Java grammar and keep it up-to-date with the current spec. Well, grammars for other important programming languages would be nice too, but particularly Java, since they need it internally anyway, and besides, providing a tool for Java developers was, after all, the original focus of the project, no?

Well, regardless, the founder of the aforementioned JavaParser project decided to separately maintain this piece of the JavaCC project (what had originated as that, anyway) and, okay, so be it. But then, I would also make the point, as regards the other project I mentioned, PMD, if Señor Julio Vilmar Gesser and whatever community he could attract had already taken on themselves the task of maintaining this Java grammar, why should the PMD people have to maintain their own separate version of this very same file? (At least, originally, it was the same file.)

You see, this brings us to a more general point... Somehow, and I'm not even sure exactly why or how, this whole Java parser generator space seems to be stuck in some kind of time vortex. Well, the JavaCC project is itself a horrid case. Everything about it gives off this musty cobwebs-in-the-attic sort of smell. But it's not just JavaCC. The whole idea that all of these separate projects need to maintain their own separate Java grammar, which is probably about the same on the 98% level, it seems like something from a much earlier stage of computing. In the old days (I'm thinking late 20th century mostly) developers would frequently have their own implementation of Hashtable or a growable array (like java.util.ArrayList) that they use and maintain separately. Just part of their toolkit, like Clint Eastwood's six-shooter, say. But nowadays there is the understanding that any modern language would have a standard class library and you just use that. So a Java developer just uses java.util.HashMap or java.util.ArrayList. This basic idea does not originate with Java, of course, but the understanding of this idea becoming widespread probably does largely coincide in time with the appearance of Java and its rise as the most popular OOP language.

Well, these are young guys for the most part, so they grew up with these concepts presumably, so it is a bit strange that it does not occur to any of these people that there is anything wrong with this state of affairs.

My own position, on the contrary, is that anybody who uses CongoCC and also finds himself in need of a Java grammar, should really just try to use the one that is included. I do not anticipate any point in the future in which CongoCC is widely used and there are a lot of people maintaining separate Java grammars, rather than simply re-using the included one. It doesn't really make any more sense than a situation in which Java itself is widely used (which it is) and everybody is maintaining their own hash table implementation (which they aren't!)

Code Appreciation Time

I remember back in grade school or Junior High maybe, there was a music teacher who would just set some time for "music appreciation". He'd just bring in some records and play some Mozart or Beethoven or Tchaikovsky and try to get the class to appreciate it. For aspiring programmers, some code appreciation time might be a good idea too, no? I hope the reader will humor me and just compare the following files side by side:

This one from the aforementioned JavaParser project

The same file (basically...) from PMD

and now:

The Java grammar used in CongoCC internally

Actually, in very recent correspondence with Brian Goetz I asked him to compare the files and give his impression. Okay, I know that I'm an insufferable show-off, but Mr. Goetz did humor me and took a look and maybe you will too, Dear Reader. (Actually, to be precise, I only pointed Brian to the latter two files. For some reason, I did not point him to the first of the three above.) I asked him to consider how elegantly and economically certain things are expressed in my version. Well, Mr. Goetz agreed. He wrote back:

What a difference! The lack of separation of concerns just flies off the page on the first one.

Of course, I was quite flattered, but actually I think that Brian was largely reacting to the aesthetics. The newer streamlined syntax just looks so much cleaner since it removes so much of that legacy visual clutter. And I guess that's what I was mostly pointing out to him anyway. But, on further reflection, I suspect that he still may not quite grasp just how much more economically CongoCC expresses things -- or really, I mean to say, the full extent of it. You see, for one thing, the PMD version uses the legacy JJTree tool, which does generate all these ASTXXX node classes, but since there is no INJECT instruction, they have to post-edit and subsequently maintain all those files separately.

So they have this separate package here which contains something like 12,000 lines of code that, even though it is largely generated boilerplate, is checked into the repository and separately maintained by hand. So, if they want their ASTPackageDeclaration node to have a little getName() convenience method, they have to add it in the appropriate file, here specifically.

The CongoCC version handles these sorts of things via INJECT, and you can see that here. Well, that's so short that I'll just insert the relevant code here:

 PackageDeclaration : (Annotation)* "package" =>|| Name ";" ;

 INJECT PackageDeclaration :
 {
    public String getPackageName() {
        Node nameNode = getChild(getChildCount()-2);
        return nameNode.toString();
    }
 }

CongoCC simply regenerates the PackageDeclaration.java file each time and it has the method injected into it. This is just the typical usage pattern for CongoCC. Suppose you're working on that piece of the grammar, where the PackageDeclaration is defined. You think: "I need the generated Node object to have this getXXX method" and you just add the code injection right below (or right above, if you prefer) where the construct is defined. So the above generates something like:

package org.parsers.java.ast;

import org.parsers.java.*;
import static org.parsers.java.JavaConstants.TokenType.*;

@SuppressWarnings("unused")
public class PackageDeclaration extends BaseNode {
    public String getPackageName() {
        Node nameNode= getChild(getChildCount()-2);
        return nameNode.toString();
    }
}

But the truth is that it hardly matters what it generates, because there is rarely much more need to ever look at it than there is to eyeball a .class file. (Though if you do look at it, it's more readable than a .class file, of course.) So, the PMD project checks in and maintains by hand thousands of lines of code, mostly quite trivial code or just generated boilerplate, but the above file generated by CongoCC is not even checked into the repository, since it is just re-generated each time. And again (since it bears repeating) there is rarely any need to open the file or look at it. That, and another 100+ generated Token/Node sorts of files are just out of sight, out of mind. You do your work on the grammar file and these other files are just generated from that and you don't need to post-edit them, or even look at them really. (There, I just repeated myself again!)

So, the difference is really quite dramatic. However, amazingly, the comparison with the JavaParser project is even worse than with PMD. Since they don't use JJTree (which is understandable, I grant) they have to manually insert their own tree-building actions in their grammar file. So forget about their Java grammar ever being re-usable by other projects because it can only be used in conjunction with their own Node/Vistor traversal API that is for their own internal use that is expressed in something like 40,000 lines of code. Well, to be clear, and also to be maximally fair about things, other projects can re-use all this and they do, but what I mean is that the grammar file they maintain is not really separately usable. If you use their work, you end up using this whole very big heavy solution (or so it seems to me). JavaParser looks like a very bloated, over-engineered solution for a typical use case where you just want to read in some Java source files, build a tree, and extract whatever information. But I grant that that could be a question of personal taste...

Well, I've said more about all this than I ever intended to. Whatever one can say about the people behind these respective projects, they are not exactly lazy. I certainly would never have the energy to maintain tens of thousands of lines of extra code to represent a Java AST, which is why, in my own version, I express the same things (pretty much the same, it's not clear what extra things their code really does) in about a thousand lines and the rest is just generated. I'm getting old, I guess, and don't have the energy to deal with a huge codebase like that. So, if somebody offered me a tool that provides this magnitude of time savings, I would sit up and pay some attention...

Now, I suppose you might be thinking that the reason these people have shown no interest in my work is that they simply don't know about it. But you would be wrong. I've been in touch with individuals from both groups and there is simply no way of conveying to them the idea that they could make their lives a lot simpler by upgrading from the legacy JavaCC to CongoCC. Well, I think that a couple of them know that I have a point, but... well... there are people, particularly younger ones, who really do not want to be told anything. Of course, most people don't like to be told anything, but these people really don't. I reckon they'd rather go spend a few hours at the dentist than have me tell them anything.

So there you go...

Close enough for government work?

For some reason, I thought it worthwhile to explain some of the pedigree of that Java grammar file. However, practically speaking, the more important point is really just how sloppy it is. It emerges from the original round of work on JavaCC, which incorporates some quite good ideas, but was implemented in a very sloppy, cowboy-ish kind of way. It just was. And that sample Java grammar is very much in that vein. At various points in the grammar, there are comments to the effect that: "Well, this isn't really correct but to do it the correct way would entail some performance loss so..."

The strange thing about that, though, is that surely, a first-pass "proof of concept" Java grammar, included in a very early version of the JavaCC tool, should be maximally correct and any cutting of corners to make it run a bit faster should occur in a separate stage. Or really, just separately. It really seems to me that the JavaCC project should provide a reference that is as close to the spec as humanly possible (or practical) and if anybody needs something more optimized for speed or space or whatever, that is their problem. Or, at the very least, it is something that should be addressed (if it proves necessary) later.

So, as things stand, that grammar and all the versions of it -- the one maintained by the JavaParser people or the PMD people or, up until recently, the one maintained by me -- was really quite outrageously loose. I never addressed this before, because all of my previous iterations of work on the thing were just to make sure it could parse all the main constructs up to JDK 15. Or in other words, I was focused on accepting that which was valid, not on rejecting that which was invalid. So, just in the last week or so, I turned to that problem and now, at least as far as I can see, the included Java grammar really does correctly implement the language specification. So, here are some of the various issues that I addressed:

Modifiers

That legacy grammar did not make the slightest attempt to handle the question of which modifiers can be used in which contexts. You could put any modifier keyword practically anywhere, so you could declare a local variable as public or private when, in fact, the only valid modifier in front of a local variable declaration is final. You could declare a variable (or a class or interface) to be synchronized. (Only a method can be synchronized. Though the keyword synchronized can also be used in a synchronized block.)
You could even repeat modifiers, writing things like public static public... or have sets of modifiers that are incompatible, like private public... or abstract final....

Assignments

Only certain kinds of expressions can be on the left-hand-side of an assignment. You can't write: this=7; or foo()=bar(); since the expression on the left-hand-side of an assignment must be a variable that you can assign to.

This also applies to prefix and postfix increment/decrement expressions like x++. The x has to be a variable you can assign to. That old grammar was written in a way that accepted things like (x+7)++ and so on.

Only certain expressions can stand on their own as statements.

By the way, I have to admit that I only know about all this stuff in such detail because I finally broke down and consulted the ~~bible~~ JLS, Java language specification. An expression can stand alone as as statement if it is one of the following:

a method call
an assignment
an object instantiation

Thus,

 2+2;

is not a valid statement. Nor is:

x;

There are some strange wrinkles in all of this. For example, you can write:

 (n)++;

or:

 (x) = y;

The compiler would interpret the above as being the same as n++; and x=y; respectively. That kinda makes sense, but if the compiler did not allow you to write these things, that would be okay too. (For me, anyway.) I was actually suprised that these were permissible.

Strangely, though, you cannot write:

 (x());

even though, by a similar logic, this should be interpreted as meaning the same thing as just x();. It very much seems to me that if you permit the preceding statements with superfluous parentheses, you ought to accept this last one as well. Not because it's useful or anything. It's just a question of consistency.

Of course, I suppose it's not written anywhere that the designers of a programming language have to be consistent. Life doesn't have to be fair either even if we would like that also. So there you go...

Anyway, the Java parser in legacy JavaCC accepts all of the above constructs with no complaint. Now, granted, arguably, it could make sense to do your parsing in a very loose manner and then do a post-parse tree walk that catches these problems and reports them to the user. (I suppose that's what JavaParser does. I notice they have some classes like Java8Validator that must walk over the tree and catch these things, but I think that to have a parser that just catches these things straightaway is a good idea.)

My thinking about this recently has been that, since Congo generated parsers will now have the ability to parse in a fault-tolerant mode, if you are in the fault-intolerant mode, the grammar provided for Java (or any other major programming language) should be maximally correct. If you want it to be forgiving of errors, then just turn fault-tolerant parsing on. That's what it's there for! (Yes, sometimes you can have your cake and eat it too.) So, over the last few days, I put in a bit of work to make the built-in Java grammar work according to spec. In fact, as best I can tell, it works exactly according to the spec. I even decided just to implement the spec quite precisely for now, even when it's a tad dubious.

For example, in the JLS (the latest, great version, for JDK15) the specification of ThrowStatement is:

 ThrowStatement:
     'throw' Expression ';'

There is no further specification that the Expression in question cannot be an arithmetic expression, say. So, just by the "letter of the law" the above specification permits:

 throw 2;

or :
throw n->n+7;

Clearly there are a bunch of things that cannot be thrown. Granted, in the following:

 int x=2;
 throw x;

it is fairly obvious to the naked eye that this is invalid. However, it is not really syntactically invalid because, based on the bounded analysis of the statement itself, i.e. the second line above, x could be an exception instance. We know it's not because we see the line before it, but look at it this way: both lines stand on their own syntactically. Or, to put it another way, a parser is not expected to do any extra analysis to try to figure out whether x in the above really is an exception that can be thrown. That's the compiler's problem.

As for throw 2;, well, the expression to the right of throw is an integer literal and it can be seen with no extra analysis that the statement is invalid. However, the java language spec says that a valid ThrowStatement is of the form:

 ThrowStatement:
    'throw' Expression ';'

so throw 2; is "valid" by the letter of the law. It is certainly invalid semantically but again, that's the compiler's problem.

I'm of two minds about this, to tell the truth. I might well patch the grammar later on to disallow things like this that are obviously FUBAR. But, you see, at this stage, I decided to implement the java language specification exactly -- even in cases where I could do a little better.

And I think that what is now there in the repository is that, a faithful implementation of the JDK 15 20 language spec. If the spec is loose at certain points, I'll be equivalently loose, but I'll be exactly as strict as the spec is. That's my current position.

So there you go.

Oh, hold on, except for one remaining detail...

Transcending the Basic Multilingual Plane

When Java was first created, it was quite an advance, because it was (as far as I know, anyway) the first major programming language specified to work with full Unicode. So you could write:

 String país = "Canadá";

or:

  Город город = new Город("Москва");

or even:

  城市 北京 = new 城市("北京");

The full range of unicode characters could be used in the names of variables, methods, types... Also in string literals.

It was a very noble idea to treat non-English speakers as first class citizens. Not at all typical of existing computing culture. Or anglo-saxon culture generally, come to think of it...

So, early Java (and JavaCC which dates back to the early days of Java) automatically supported full unicode, all the 50,000 or so characters in the unicode spec.

Well, the problem is that the current unicode spec has far more characters than that. The characters that can be represented internally in 2 bytes comprise unicode's plane 0, a.k.a. the basic multilingual plane. JavaCC was never updated to handle the supplementary characters, such as plane 1, which is the supplementary multilingual plane or plane 2, which is the supplementary ideographic plane. And, in fact, many of these supplementary characters are considered permissible in Java identifiers.

Now, at the time of this writing, CongoCC has been refactored to handle characers outside of plane 0, which are the ones that require 4 bytes of internal storage. This is not much of a practical issue usually, perhaps, but since this project set the goal of having a Java grammar that is absolutely correct, this bears mentioning.

See this announcement for more information.

Post Views: 3,298