Fault tolerant parsing, is it a pipe dream? (An answer to Angelo Zerr)

This is a response to a private email from Angelo Zerr, but finally, I thought to put my answer in public so that it could attract more general discussion.

Angelo wrote me saying:

I don't know if I have explained very well my idea with tolerant, incremental parser and LSP generation project. But let for the future this idea.

Well, I'm pretty sure we're thinking along similar lines, Angelo. (Though do correct me if I'm wrong.)

I'll share some of my thinking on this. I'll get a bit concrete with some actual code. Consider the following code snippet:

     
     x = 1;
     foo(bar);

The default Java parser included with JavaCC 21 produces an AST (or partial AST) that, when dumped in an indented text form, looks like this:

           StatementExpression
              PrimaryExpression
                x
              =
              PrimaryExpression
                1
            ;
            StatementExpression
              PrimaryExpression
                foo
                InvocationArgs
                  (
                  PrimaryExpression
                     bar
                  )
            ;

Now, suppose we introduce a simple an obvious error in this code. We remove the semicolon at the end of the first line. Well, that would be okay in Python, but not in Java or any language with a C-like syntax. The statement has to have a semicolon at the end.

So, in this venerable parser generator space, how does a parser generated by JavaCC (or Yacc or any such) deal with this kind of input?

Well, it just gives up. Right?

It exits with error in C or throws an exception in Java or whatever. This is not very useful for an interactive tool like a source code editor or IDE. The other strange aspect of this is that the error is completely recoverable. I mean, a human just scanning over such code where one line like that is missing the semicolon at the end does not have to stop! He can keep scanning! But the machine generated parser apparently cannot. This is the tradition in this space, as far as I can see. There is no serious thought put into dealing with erroneous input.

Now, it seems to me that there is a fairly obvious solution. Or the beginnings of a solution anyway. Clearly, when you hit a point where you obviously need the semicolon, you insert one! An imaginary or virtual semicolon. And you continue parsing and you build your tree!

      StatementExpression
         PrimaryExpression
            x
         = 
         PrimaryExpression 
            1 
         ; #inserted imaginary semicolon

So, you see, somehow, any tool that walks the tree can happen on a node and see that it was inserted. It can even feed the source into a compiler as long as it puts a "real" semicolon into the stream where there is the imaginary one, right?

Of course, the whole question of how to handle certain errors just doesn't seem to be part of this parsing literature. If you consider something like:

      public static public final int = 9;

Well, this is not valid code in Java, because you are only supposed to specify "public" once. But then, if you specify your grammar in the most direct, simple-minded way and you have something like:

     ("public"|"static"|"private"|"final")* ...

Well, then the above snippet is okay! Probably you would write it this way and then have a code action that checks for things like this. Also, you obviously can't write:

     public static private int x = 9;

but the above grammar would accept that as well. It seems that the best solution for most use cases would be just to build up the AST and then in a post-parsing sanity check, just walk over the whole tree and check for things like this. Certainly, that would be the most practical thing for a source code editor. The funny thing here is that I just noticed that the Java parser included in JavaCC 21 actually parses the above sorts of things without complaint! So the AST it generates looks like this:

     ClassOrInterfaceBodyDeclaration
        public
        static
        private
        FieldDeclaration
          int
          VariableDeclarator
            x
            =
            PrimaryExpression
              9
          ;

Actually, I was vaguely aware that my parser accepted this kind of input but, for my purposes, it didn't really matter! Of course, a post-parse tree walking step could mark the ClassOrInterfaceBodyDeclaration node as invalid. But this invalid code doesn't prevent the parsing from continuing.

Well, I guess the real point here is that I have been thinking about all these things and you asked me specifically about the FreeMarker project, but really it seems to me that this has to be addressed at the more fundamental level in JavaCC itself, because you just constantly have this situation where some delimiter is missing and it's quite obvious what the problem is, like somebody writes

if (foo(x) == bar(x) {
  ...
}

and he's missing the closing parenthesis for the if condition. So you insert the missing element, an imaginary closing parenthesis to balance out the expression, but again, any tool walking the AST can see that the node is invalid, was inserted. And then, conversely, on occasion, you'd have to remove an extraneous one and the node is present in the AST but is ignored for parsing purposes.

if (foo(x) == bar(x))) {

}

I'm thinking about how this gets expressed in a JavaCC grammar, like formally stating that if there is no semicolon at the end of a statement, you insert an imaginary one, things like that. But, I am thinking about how to get these things into the parser generator so that any project that uses JavaCC to generate a parser can just have a lot of this stuff automatically working.

So, let's see... you went on to write:

IMHO I think if you want that user will adopt your FreeMarker project, you should give the capability to integrate it easily inside an IDE (tolerant, incremental, debugging support).

Well, the above examples I gave were in Java (or could be C# or C++ or something like that actually...) but this kind of problem of dealing with invalid input, in particular cases that should be quite straightforward, like unbalanced parenthesis, are not specific to any programming language. So, you ask me about FreeMarker, but I think that where that comes in is that FreeMarker can end up being a really big functional showcase of these sorts of features, but it should be generally useful outside the FreeMarker project.

You know, generally speaking, when I thought to get involved with JavaCC (and ended up having to fork off a new project that I called FreeCC) I had a strong gut feeling that these sorts of parser generator tools can be a lot more generally useful than they really are.

I've given some thought to this and I think that you can understand a lot of the problem by just looking back at the origin of these sorts of tools, like YACC and so forth. This was a stage of computing history where there was actually quite little interactivity. The way people wrote programs in languages like FORTRAN and COBOL is that they had these stacks of punch cards that they fed into a punch card reader. That was called "batch processing". So if your "batch" didn't run properly, then you had to go back and figure out what was wrong, the missing semicolon or whatever. So this would explain why tools from that era don't really address the most basic problems that somebody working in the current day and age runs into, like providing useful interactive feedback to somebody using an IDE.

Well, there are something like 7 billion people in the world, so we can't be the only people thinking about this, but still, I find the whole thing quite interesting and it's interesting to see that in the parser generation space, the dominant tools still have this very 1970's mind-set that their job is simply to parse valid input and nothing more, so if the input is incorrect, well, that's your problem! They don't even really think about that very much. Well, my intention is to come at this from a more practical angle and see what can be done with a stone-age tool like JavaCC to make it more practically useful to more people.

Hey, maybe I can get some people interested in this!