New Experimental Feature: ATTEMPT/RECOVER

A bit over a month ago, I wrote about my intention of tearing out support for "JAVACODE" productions. A JAVACODE production is really just a Java method that is like a pretend grammatical production. I could find very few JavaCC grammars in the wild that used this.

I asked whether anybody would miss them and nobody answered. This could be because nobody cares or nobody is paying any attention anyway. Regardless, they are gone. RIP.

Another related legacy JavaCC feature (perhaps using the term "feature" loosely)  is try-catch. I don't mean the try-catch that is part of the  Java language. I mean the JavaCC try-catch where you put a grammatical expansion inside the try block. Like so:

try {
  Foo() Bar() Baz()
}
catch (ParseException e) {
   // Some arbitrary Java code
}

So you have a grammar expansion inside the try block and then you have a catch (one or more catch blocks and maybe a finally block, just like Java) in which you put your Java code to handle the error. Except.... hold on...

What do you do inside the recovery block?

Beats me. Just as I pointed out in my post about getting rid of JAVACODE productions, JavaCC provides no real disposition for error recovery. With this sort of try-catch, the general situation is that  the exception bubbled up from deep in the bowels of our grammar, I mean some deeply nested sub-expansion, right? So what are we supposed to do in the catch block? Or to frame the question more precisely: What do we do in the catch/finally section that is more useful than just letting the exception bubble up to whatever default handler?

Well, I think the cold hard truth of the matter is that there really is not much to do be done in this spot.  Ergo, this "feature" is simply not very useful. And that could explain why nobody uses it! I scoured the web trying to find real-world usage examples of this this try-catch and came up with nothing. Really nothing, even less than JAVACODE productions. I can't find a single JavaCC grammar out there that does this.

Of course, the feature being about as useful (as a nun's... fill-in-the-blank) is only one explanation for nobody using it. Another possible explanation for nobody using the feature is that people don't even know about it! I tried to think back about whether, in my FreeMarker development days, I even knew that you could put this kind of try/catch in a grammar production. I honestly can't remember whether I even knew that the feature existed. (Another damned senior moment, eh?) I likely knew at some point, but I wouldn't be surprised if I was a heavy user of JavaCC for years before happening on this. After all, the main way that people learn JavaCC is by studying and adapting existing grammars, and if the existing grammars simply never use this...

Well, anyway, the existing try/catch really is not useful for a very simple reason: it doesn't rewind to the state of the parse before the attempted expansion. So I introduced an alternative construct that does do that. And instead of try/catch, it is ATTEMPT/RECOVER. The syntax (and if I get feedback, it could still change) looks like this!

ATTEMPT(Foo() Bar())
RECOVER 
{
   // optional java code block
}
(
    FooBar() Schmoobar()
)

So, you ATTEMPT to parse some expansion and then after RECOVER, you can have two blocks, one being a block of Java code or another Expansion to fall back on. Now, actually, as things stand, you can have both the java code block and the recovery expansion or just just one of the two. (Though I guess you could effectively have neither by simply putting in an empty java code block {} and not having any recovery expansion.

In any case, the idea is that your java code tweaks something or other so that you can recover. Maybe it skips past some invalid goo or it changes to another lexical state before resuming the parse.

Regardless, the key thing to take away from this is that when you hit RECOVER, the state of your world is restored to what it was right before the ATTEMPTed expansion. That includes the state of the tree building machinery and your lexical state and such.

ATTEMPT/RECOVER semantics?

Now, this is an experimental new feature and I am quite interested in getting feedback about how it should work. For example, I am grappling with the question of how syntactic lookaheads should deal with an ATTEMPT/RECOVER block. The current state of things is that if you write:

void Foobar() : 
{}
{ 
   ATTEMPT(Foo()Bar())RECOVER(Baz())
}

In the above case, a syntactic LOOKAHEAD, like LOOKAHEAD(Foobar()) will create a lookahead routine that scans forward for the ATTEMPTed expansion Foo() Bar() but does not check for Baz().

The idea is that the Foobar() production really only completes normally via Foo() Bar(), not the recovery expansion Baz(), which we are using to fallback to if we can't do Foo()Bar().

It could be argued that LOOKAHEAD(Foobar()) should check forward for Foo()Bar() OR Baz() since both of them would end up matching the production. I'm really not sure and would be quite happy to discuss with people how this should work. This would be anybody's chance to have some input at an early stage to how the next generation of this tool will work.

So I'll just close by saying that this new feature is currently experimental and subject to change and we are very interested in feedback. I would also add that the feature, though already far more useful than the existing try-catch (that is still available, by the way) it will become more useful over the coming weeks and months, as more error-recovery machinery is introduced, so that there is a clearer answer to what one can do in the recovery block!

Notable Replies

  1. Hi
    I believe we need to be more precise on the “state” left when encountering an exception: in the TCF (try / catch / finally): what is exactly the current behavior if we have try (p1() p2() … pn()) and exception arises in p1, p2 or pn?
    And same, what “state” you intend to recover in attempt (p1() p2() … pn()) recover () for all cases ?
    And does this depend on the exception type (parser / lexical)?
    On my side, I have TCF in real world grammar that reposition the token manager to the end of line and tries to resume parsing.
    I can imagine using attempt / recover syntax for cases like this:
    attempt ( “(” p() “)” ) recover ( attempt ( “(” p() – missing RPar – ) recover ( attempt ( p() “)” – missing LPar – ) recover ( giveup) ) ) for IDEs to handle simple cases instead of using ("(")* with the extra lookaheads needed.

  2. The idea of ATTEMPT/RECOVER is that if the ATTEMPT part fails, the parser/lexer machinery is rewound to the state it was in before the ATTEMPT. I believe this is working now, but it is hardly tested at all.

    The basic idea is that, in the general case, the RECOVER part has two components, a Java code block and then a grammatical expansion. So, presumably, the Java code block is where you have the opportunity to make some adjustments so that the parse can succeed. One could even imagine code like this:

    ATTEMPT(Foo())
    RECOVER {…some Java code…} (Foo())

    Of course, if the Java code block does nothing, then the whole thing is for nothing, because it attempts Foo() and then fails, the parsing machinery rewinds and it attempts Foo again, well, it will fail again! Guaranteed!

    So some adjustment has to happen in the java code block so that when we try to parse Foo() the second time, it will succeed. (Or at least have some chance of succeeding as opposed to definitely failing.)

    What adjustment? Well, you could move forward one character in the input stream, or skip forward one token or scan forward for a token of a certain type.

    Or you could change lexical state maybe…

    So, you are right that it is not really formalized what you can do in the java code part of the RECOVER and this is what needs to get clarified. You have to understand that this is still very much a work in progress!

    But you see that this is already laying some basis for attacking the problem. With the existing try-catch machinery, when you enter the catch block, it is very unclear what you can really do. It’s not even clear where you are. With ATTEMPT/RECOVER, the parsing/lexing machinery rewinds to the starting point, so you have some clarity about where you are at least and what can be done.

    But you know, the thing about this, the try-catch and the JAVACODE productions as well, it is very hard to find many examples f of usage out there. And this is precisely because these things are not really very useful!

  3. I was suggesting first studying what legacy javacc exacly does before making a decision for a new feature: may be with just a few lines of code you can make it reposition exactly at the “beginning” of the faulty production (token pointers, built nodes, …).

  4. Fault-tolerant parsing definitely sounds like it would be a big plus for JavaCC21, and, depending on the faults that it successfully handles, could by itself make it worthwhile for std JavaCC projects to convert to '21. I eagerly await its release to see how its fault-tolerance matches up against the fault injection rate of my newbie attempts (artificial intelligence vs my natural stupidity).
    I liked your idea of logging ATTEMPT/RECOVER (if they even exist in the fault-tolerant version), like you say, trivial to add a line to the right template, although (speaking as someone who always forgets to turn on things until they bite me in the ass) I do like the way javac sends warnings to the console and keeps on compiling. But that’s me.

  5. I think the scenario that you described (JavaCC implementer moves on and everyone keeps using the “black box” because it works) is probably pretty accurate. I know when I started working for a software development company that I was shocked by the level of disinterest shown by the programmers - based on their interest in their craft they could just as well have been doing taffy-pulling or stable-cleaning instead of programming - it was just a way to collect a paycheck. Although we were programming in COBOL so their disinterest is understandable.
    I think there’s also the “professional Guru” to be accounted that provide a lot of resistance to change. Professional Guru’s pride themselves on knowing obscure corners of a language or difficult tools and use that knowledge to gain prestige or power positions over others. And the original JavaCC certainly qualifies as an obscure corner of programming and the fact that it uses the command line instead of a GUI or an IDE makes it even more likely that disinterested programmers will infringe on their positions of power.
    And then there’s the “learning curve” aspect - much like Hell Week in football and frat-house hazing, difficult processes reward those who get thru them with the feeling of accomplishment. Same with software; the harder the software is to learn, the greater the sense the learners have of having achieved something great. And once having mastered their tool, they are unwilling to sacrifice their achievement for something that makes their lives easier.
    So focusing on new users makes a lot of sense, ones who might still have some curiosity about their craft or even ones who are just trying to avoid having to learn all that JavaCC crazy stuff, with RegEx expressions and all those empty curly braces.
    So here’s an idea that might draw in new users; a (GUI?) tool that makes it easy and quick to build a tiny single-purpose parser, something that programmers can spin up a parser class in a couple of minutes that will consume user-generated input or csv files and output some well structured output. Almost every one of us has had to depend on user-supplied info at some point in a(lmost every) program and it always starts as a couple of lines of code and then becomes a dozen lines that grows to 100 lines of ugly poorly structured code by the time all the user errors have been accounted for. Until the next time a user comes up with a unique way of generating an error.
    It’d sure be nice if there was a tool where I could just specify that fields are alpha characters and some punctuation or that <CELL_NUMBER> was numbers only, etc, push a button and get a Java class that checks those rules.
    And every ETL project that I’ve participated in eventually reached a point where the load blows up and we end up scanning the text dump to search for troublemakers until our eyes crossed. And after all that, we didn’t know if we had fixed all the problems or if we had just found all the crashable errors and then loaded lots of crud that just didn’t crash the database. And it was always a manual effort because there wasn’t time to build a tool that would search for errors.
    But I do go on sometimes. . . .
    I am glad that you’re not giving up and that you get enjoyment and satisfaction from what you’re doing. I think you’re making some great improvements to JavaCC and appreciate all your efforts in streamlining grammars - I especially hate situations when success or failure depends on correctly punctuating the code. I can accept using the wrong word or operator or method but finding that my anonymous inner method failed because I entered ));} instead of )}:wink: makes me crazy.
    And I happened on JavaCC21 because I was trying to learn JavaCC (autodidact syndrome) and wasn’t having much luck so was googling JavaCC and I came across your project and liked what you were saying about streamlining and improving how grammars are written and processed. Even though I only understood some of it, it’s slowly making more sense as I experiment with it. In other words, I wasn’t a committed user of JavaCC so I was open to improvements.
    Keep up the good work

Continue the discussion at parsers.org

10 more replies

Participants