“You can’t get there from here!” — The Problem of Context-Sensitive Tokenization

Since I picked up my work on the JavaCC codebase at the end of 2019, various people have broached to me this question of strings that can (or should) be broken into tokens differently based on the context where they are encountered.

I have to admit that it took me a while to grasp just how fundamental this problem was. Also, I came to see that, due to basic design flaws and limitations that have never been addressed, this whole problem has basically been intractable in JavaCC from the very beginning. Below, I'll explain why, and also how JavaCC21 now does provide the basic infrastructure to address the issue. I should warn the reader that this post will be fairly involved. However, I think that anybody who anticipates using JavaCC in any non-trivial way really should bear with me. A careful reading of this post will prove worthwhile. (If not, the article does come with a money-back guarantee...)

Now, to frame this whole question of context-sensitive tokens, I think it is far better to motivate the conversation with a concrete example than to start with some abstract general theory. I think the best case study would be Java generics, since pretty much all JavaCC users are seasoned Java programmers. (This should change in a medium-term time frame, but is certainly the case for the time being.)

So here is the case we shall analyze:

The Angle Brackets in Java Generics

Basic problem: the strings ">>" and ">>>" need to be treated entirely differently depending on the context in which they appear.

Example: consider the following generic type:

Map<String, List<String>>

In the above, the angle brackets, "<" and ">" are being used to open and close a list of type parameters and since they can nest, the two closing characters ">>" are actually two separate tokens.

By contrast, in the following statement:

int j = i >> 3;

the ">>" is actually a single token, the right signed shift operator. (The right unsigned shift operator, ">>>" is also problematic, of course.) Clearly, any Java parser must distinguish between the above two uses of ">>". This problem was one of the first things I had to resolve when I picked up my FreeCC work at the end of 2019. After over a decade of neglect, FreeCC did not support (fully) any remotely recent version of the Java language and my first priority was to remedy that state of affairs. So this problem of the ambiguity between the closing angle brackets and the right shift operators was one of the first things I had to solve. I did so, but my solution was not very satisfying to me, and I made a mental note to revisit the problem. What I implemented was (to my taste) a horrific kludge. It did have the merit that... well, it does work...

Context-dependent parsing of ">>" and ">>>", solution I

The solution I ended up implementing was in three parts:

  1. define the RSIGNEDSHIFT and RUNSIGNEDSHIFT tokens in a separate phony lexical state
  2. create a grammatical production to specially handle 2 or 3 consecutive ">" operators
  3. use that grammatical production at the appropriate place in the grammar where we would normally have used just a reference to the actual tokens.

The first step above looks like this:

<PHONY> TOKEN :
{
<RSIGNEDSHIFT : ">>" >
|
<RUNSIGNEDSHIFT : ">>>" >
}

This lexical state is called PHONY to provide a strong hint to anybody reading the grammar that the parsing machinery is never actually in this lexical state. The way it works is that we always tokenize ">>" and ">>>" as a sequence of two or three ">" tokens.

That is the first step, but it is the second step that provides the little magic trick to square the circle. This involves defining a grammatical production called ShiftOperator as follows:

void ShiftOperator() #void :
{
   Token first, second, third=null;
}
{
   first=">"
   second=">" 
   [third=">"]
   {
      Token result = null;
      if (second.getBeginColumn() != first.getBeginColumn() + 1) {
        throw new ParseException("Shift operator cannot have spaces or comments inside!\n" + getInputSource() + ":line " + first.beginLine + ":column " + first.beginColumn);
      }
      if (third!=null) {
        if (third.getBeginColumn() != first.getBeginColumn() + 2) {
            throw new ParseException("Shift operator >>> cannot have spaces or comments inside!\n" + getInputSource() + ":line " + first.beginLine +":column " +first.beginColumn);
        }
        first.image = ">>>";
        first.kind = RUNSIGNEDSHIFT;
        first.next = third.next;
        popNode();
        popNode();
      } else {
        first.image = ">>";
        first.kind = RSIGNEDSHIFT;
        first.next = second.next;
        popNode();
      }
    }
  }

What the above grammatical production does is that it handles two or three consecutive ">" tokens that come from the lexical machinery and merges them into a single RSIGNEDSHIFT or RUNSIGNEDSHIFT operator.

The third step above is to use this ShiftOperator production where we would normally use the tokens. Like this:

void ShiftExpression():
{}
{
    AdditiveExpression()
    (
        LOOKAHEAD(2)
        (
        "<<"
        |
        ShiftOperator()
    )
    AdditiveExpression()
    )*
}

Again, the above 3-step solution does have the merit that it works. However, it was deeply unsatisfying to me and I made a mental note to revisit this issue and find a cleaner solution.

I eventually did put in place a better solution, just in the last few days, but it turns out that this was only possible because of some quite significant refactoring/enhancement of the codebase that had occurred since the earlier hack. I am now pretty sure that, in legacy JavaCC, it is simply impossible to implement a significantly better solution than the 3-step kludge I describe above. The legacy JavaCC project actually does have a different solution to the above problem, but to my taste, it is actually somewhat worse than the one I outline above. That might be a question of taste, but regardless, it is also a messy 3-part hack.

The Current Solution

The current solution involves using a TOKEN_HOOK method to handle this case. It looks like this:

INJECT PARSER_CLASS : {
  private Token TOKEN_HOOK(Token tok) {
    String img = tok.getImage();
    if (img==null || !img.startsWith(">")) return tok;
    boolean inGenericTypeSpec = isInProduction("TypeArguments") || isInProduction("TypeParameters");
    if (inGenericTypeSpec && (img.equals(">>") || img.equals(">>>"))) {
        // If we've entered the TypeParameters or TypeArguments production, we need to split
        // a ">>" or ">>>" into 2 (or 3) GT tokens.
        Token gt = Token.split(tok, 1, GT, GT);
        if (img.length() == 3) {
          Token next = Token.split(gt.getNext(), 1, GT, GT);
          gt.setNext(next);
        }
        return gt;
    }
    else if (!inGenericTypeSpec && img.length() ==1) {
      // In this case we do the reverse. We merge 2 (or 3) GT tokens into a right shift operator
      Token next = tok.getNext();
      if (next != null && next.getType() == GT && next.getBeginColumn() == tok.getBeginColumn()+1) {
        Token nextNext = next.getNext();
        Token merged = Token.merge(tok, next, RSIGNEDSHIFT);
        if (nextNext != null && nextNext.getType() == GT && nextNext.getBeginColumn() == next.getBeginColumn() +1) {
          merged = Token.merge(merged, nextNext, RUNSIGNEDSHIFT);
        } 
        return merged;
      }
    }
    return tok;
  }
}

What the above token hook routine does is that it checks whether we are in a TypeParameters or TypeArguments construct. If YES, we are (i.e. inGenericTypeSpec is true) AND the next token in the stream is a ">>" or a ">>>", then it splits that token into two (or three) ">" separate tokens.

It also must deal with the opposite case: if we are NOT in a TypeParameters or TypeArguments construct (i.e. inGenericTypeSpec is false) AND we have two (or three) adjacent ">" tokens, we merge them into a single ">>" (or ">>>") token.

(N.B. The Token.split and Token.merge methods used in the above method were added to the Token.java boilerplate, since I anticipate them being generally useful to others.)

With the above TOKEN_HOOK method in place, the rest of the grammar that refers to these things can be written naturally. Thus, there is now no need for a separate PHONY lexical state. The RSIGNEDSHIFT and RUNSIGNEDSHIFT tokens are simply defined along with everything else in the default lexical state.

There also is no need for a separate ShiftOperator grammatical production. Thus, the ShiftExpression production is now just written in a completely natural way, as follows:

ShiftExpression :
  AdditiveExpression
  (
      ("<<" | ">>" | ">>>")
      AdditiveExpression
  )*
;

(N.B. The above production is written in the new streamlined syntax)

The Devil is in the Details

The current solution is still a bit crude and will probably be refined over the coming period by introducing certain much-needed constructs in JavaCC itself. At the moment, we are still dropping into Java code; ideally, the whole thing could be expressed in JavaCC itself.

However, the current solution is solid and correct in a way that the previous hack was not. Moreover, this improved solution is simply not feasible in legacy JavaCC (and was not feasible in JavaCC21 until very recently) for a couple of key reasons.

  1. Due to the way it is implemented, the legacy CommonTokenAction hook cannot be used to handle the above case.

  2. Legacy JavaCC offers no way to query whether we have entered a given grammatical production. That is what the isInProduction method in the above TOKEN_HOOK method does.

The key to understanding the first point above is in the first line of the code injection:

 INJECT PARSER_CLASS {...}

This TOKEN_HOOK method is being injected into the XXXParser class, NOT the XXXLexer class!

You see, the generated parser has a getNextToken method that looks something like:

private Token getNextToken() {
  Token result= tok.getNext();
  if (result== null) {
      result= token_source.getNextToken();
      tok.setNext(result);
  }
  tok.setNext(result);
  return result;
}

The real first-order impedance here is this: if the currentToken already has its next field set, the token_source.getNextToken() method is simply never called! So, you obviously can't solve this problem with a CommonTokenAction hook because it simply won't be invoked! At least, not reliably. You see, if a lookahead routine has already scanned ahead and tokenized ">>" as a single RSIGNEDSHIFT token, when it comes time for the parser machinery to pull the next token off the stream, as in the above method, it will return that token and your token hook method never gets the chance to do its thang: namely, to break it into two ">" tokens.

This is the key reason (or actually one of two key reasons, it turns out) why legacy JavaCC simply cannot solve this problem in a clean way. You would always need something like the rather grotesque 3-part hack that I originally came up with, and is outlined above.

This is a key concept, but also, properly understood, it could be considered to be part of a larger problem: there is simply no concept of rewinding the parsing/lexing machinery in legacy JavaCC.

You see, once you have context-dependent tokens and an arbitrary amount of lookahead, if the lookahead fails, you would very frequently need to rewind the token stream or somehow revisit the cached tokens, as we do here. In this specific case of Java generics, if you had a lookahead that scanned ahead to see if what follows is a certain kind of expression, it could identify ">>" as a single token. If the lookahead fails and now we are going to parse the same input as if it was a type declaration, it needs to have the smarts to break the ">>" into two separate ">" tokens.

But again, there is simply no point at which to do that! The token hook routine needs to be inserted in the appropriate place in the parser (NOT the lexer) code.

This is fixed in JavaCC21. In the above getNextToken routine, our TOKEN_HOOK routine (injected into the PARSER_CLASS!) gets invoked in the right spot. So it now looks like:

final private Token nextToken(Token tok) {
    Token result= tok.getNext();
    if (result== null) {  
        result= token_source.getNextToken();
        tok.setNext(result);
    }
    result= tokenHook$Java_javacc_line_42(result);
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    tok.setNext(result);
    return result;
}

In the general case, this is where the token hook method has to be inserted! At least if you want to use it to resolve this kind of problem.

(N.B. JavaCC21's TOKEN_HOOK feature is also significantly improved because you can define multiple TOKEN_HOOK methods and they will be inserted in the right place. The placeholder "TOKEN_HOOK" gets replaced with a unique method name. So you can define more than one! This is explained here.)

Note however, that being able to inject the token hook method in the right place is still not quite sufficient because... well... legacy JavaCC is simply too underpowered: it gives you no robust, standard way of knowing whether you are in a given production. That is what the line:

 boolean isInGenericSpec = isInProduction("TypeParameters") 
                           || isInProduction("TypeArguments");

is about. The isInProduction method is leveraging the same machinery that is used by JavaCC21's lookbehind predicates. It works from within a regular parsing routine or a lookahead routine.

There are some various other points to be made about this, but this post may be reaching (or have already reached!) a point of diminishing returns. So, I'll try to wind it down here.

Concluding Remarks and Recap

It occurs to me that the TOKEN_HOOK method that I post above may still look pretty hairy to most readers. However, once you understand its logic, it should not be so intimidating.

The basic problem is that the input ">>" or ">>>" means something entirely different depending on the context in which it occurs. Once we have a syntactic lookahead routine that tries to scan this as the first case, we have a big problem if that routine returns false, because the way the code was structured (for the last 24 years or so) there was no way to inject a token hook (what was called CommonTokenAction in legacy JavaCC) to deal with the case. Or, in other words, the specific cases that CommonTokenAction should be able to handle could not possibly be handled that way!

Like... dude.... you can't get there from here!

The TOKEN_HOOK method above handles two basic cases.

  1. The parsing machinery has scanned ahead and identified the ">>" as a single token, but our token hook method needs to break it into two ">" tokens.

  2. It scanned ahead and identified the ">>" as two separate ">" tokens but it now realizes we are in an arithmetic expression where ">>" (or ">>>") means a right shift operator, so it has to merge the adjacent ">" tokens into a single token.

Of course, depending on the exact problem you are trying to solve, the solution could be something quite different. Actually, the more astute readers will realize already that the above discussion is not really specifically about solving this problem of the angle brackets in Java generics. That is simply one example of a much more general problem. Even if, at first blush, your problem is quite different from this one, if you are running into a wall because you need input to be scanned differently based on context, your first step should be to carefully go over the above example and make sure you understand it. Admittedly, there are some details that I have glossed over here. You can expect a later post (or two) soon to tie up those loose ends. And, of course, you are more than welcome to sign up on the Discourse forum and pose any questions or offer suggestions

Notable Replies

  1. I mentioned that the legacy JavaCC had a different kludge to deal with this problem. I didn’t state what it was because I felt that the blog article would be too long, but I figure I might as well outline it, since it is presumably about the best solution that the JavaCC maintainers could come up with in over a decade.

    I earlier said that their solution was also a 3-part kludge, but now that I look at it again, I think it is best described as 5-part kludge, as follows:

    1. They define a specialized Token subclass called GT which has an extra field called realKind. You can see that here.
    2. Then they have a custom newToken factory method in the Token.java that instantiates the GT subclass for the token types GT, RSIGNEDSHIFT, and RUNSIGNEDSHIFT, i.e. “>”, “>>”, and “>>>”. You can see that here
    3. They define lexical actions for RSIGNEDSHIFT and RUNSIGNEDSHIFT token types that set this realKind field. The regular kind field is set to GT but then this extra field realKind says whether what we really have is a RSIGNEDSHIFT or RUNSIGNEDSHIFT.
    4. They create two grammatical productions for RSIGNEDSHIFT and RUNSIGNEDSHIFT. That is
      here
    5. Use the above grammatical productions in the grammar where one would normally just use the Tokens.

    The first two steps above involve post-editing the generated Token.java file. This can be avoided in JavaCC21. In the Token production where you define these GT, RSIGNEDSHIFT, and RUNSIGNEDSHIFT tokens, you could just start it with:

    TOKEN #GTToken : {
        ... etc ...
    } 
    

    This would mean that the Tokens defined in here are instantiated as instances of GTToken. And a GTToken.java is automatically generated. And then you would have a little code injection to add that realKind field that looks like:

    INJECT GTToken {int realKind;}
    

    This way you can avoid the whole boondoggle of post-editing generated files. So, this solution is particularly horrible in legacy JavaCC because it means that if the standard boilerplate in the default Token.java file ever changes, your post-edited Token.java file must be kept in sync with those changes.

    The third step involves adding lexical actions for when the RSIGNEDSHIFT and RUNSIGNEDSHIFT tokens are matched. That looks like:

    < RUNSIGNEDSHIFT: ">>" >
    {
        matchedToken.kind = GT;
        ((Token.GTToken)matchedToken).realKind = UNSIGNEDSHIFT;
        input_stream.backup(2);
        matchedToken.image = ">";
    }
    

    In any case, the solution up to now is to instantiate the tokens representing a right shift operator as a special Token subclass that has a kind and a realKind field. So, as you would guess, the parsing machinery is going to parse this token using its kind or its realKind depending on the case.

    So, now they create two different grammatical productions to represent the two (signed and unsigned) shift operator. Like so:

    void RUNSIGNEDSHIFT():
    {}
    {
        ( LOOKAHEAD({ getToken(1).kind == GT &&
                        ((Token.GTToken)getToken(1)).realKind == RUNSIGNEDSHIFT} )
           ">" ">" ">"
        )
    }
    

    The above production uses semantic lookahead to check whether the next token off the stream is a GT (its “official” type) AND its real kind is RUNSIGNEDSHIFT (or RSIGNEDSHIFT depending on the case).

    The fifth and final step is that productions thus defined are defined here

    So that is the 5-step hack. I hope you did not just eat your lunch, dear reader.

    Looking back on this, when I needed to solve this problem at the end of last year, I must have looked at this and… like… no ****ing way!

    It was just too horrible – that, even taking into account that it is less horrible in JavaCC21 because there is no need to post-edit the generated Token.java file.

    So that is why I came up with the 3-step solution (still a hack) that I outline in the blog post.

    A funny thing about this is that the 5-step hack that the legacy JavaCC project uses internally is not just used by them. You can find it here unchanged and here as well.

    It’s also quite fascinating that both of those projects have left in place the comment that appears there:

    /* We use productions to match >>>, >> and > so  that we can keep the
     * type declaration syntax with generics clean
     */
    

    This 5-step hack keeps their type declaration syntax “clean”, eh? Cripes. One has to wonder what “dirty” would look like?

    By the way, in case readers are wondering about this, I actually approached both of those projects (PMD and JavaParser respectively) that were using that hack unchanged to try to interest them in my work on resuscitating JavaCC and fixing these problems, but in both cases, was met with indifference veering towards outright hostility. I have more to say about that phenomenon (not that I understand it perfectly either, mind you) but not here, since this is a technical post.

  2. And what about ANTLR?

    Antlr’s grammars-v4 repository seems to have three different (though maybe not very different) Java grammars. None of them seem to address this problem at all – I mean, the context-sensitive lexing of “>>” and “>>>”. They simply treat the signed and unsigned shift operators as two (or three) greater-than operators in a row. So you can see here that their definition of the shiftExpression production is:

    shiftExpression
    :	additiveExpression
    |	shiftExpression '<' '<' additiveExpression
    |	shiftExpression '>' '>' additiveExpression
    |	shiftExpression '>' '>' '>' additiveExpression
    ;
    

    Strangely enough, they even use two consecutive LT tokens in the production to represent the left shift operator, even though a separate Token for “<<” is not problematic! (This is because two consecutive left angle brackets cannot actually occur in a TypeArguments or TypeParameters production so “<<” might as well be defined as a Token in the lexical part!

    Well, regardless, to make a long story short, they simply decline to address the problem. One aspect of this is that the following:

     int i = j >   >   3;
    

    would be accepted by their Java grammar. As would:

    int n = m >  >  /* Hey, mon! */ > p;
    

    I haven’t yet checked this, but it seems, based on the grammar spec that it would parse either statement with no complaint! (The ghastly 5-part kludge used in legacy JavaCC would at least complain about this. That kludge only works if the “>” characters are directly adjacent to one another.)

    I also am not sure whether, by default, ANTLR4 generates an AST with the Tokens as leaf nodes. (This is what JavaCC21 does by default, hence my concern with getting this right! It would be wrong to build a tree in which the right shift operators are sequences of two or three GT tokens!)

    To put it visually (not that this is my forte) the problem is that the correct AST for the expression i >>> j is:

                shiftExpression
         /            |          \
     IDENTIFIER RUNSIGNEDSHIFT  IDENTIFIER
    

    It is NOT:

            shiftExpression
         /       / |  \   \
     IDENTIFIER GT GT GT  IDENTIFIER
    

    Now, granted, one could run over the AST in a fix-up walk after the parse and correct this, and also check whether the “>>>” is all contiguous with no whitespace or comments inside, but I see no sign that they do any of that. Well, maybe I’m being too harsh here…

    Dude, c’mon… this is good enough for government work!

    What I am still wondering about is whether ANTLR has any more disposition for dealing with this problem of context-dependent tokenization than legacy JavaCC does. If it does, they made no use of it when writing these Java grammars.

    Another aspect of the ANTLR4 java grammar (I have to admit that I only tried one of them) is that it is extremely non-performant. On my 2014 vintage IMac, I can run over all the Java 8 source code (in src.zip) parsing it and building the AST in about 15 seconds. The same benchmark using the ANTLR generated parser takes well over half an hour – i.e. is well over 100x slower. Well, frankly, I found that result so shocking that I thought I was doing something wrong. However, I tried to enlist the help of an ANTLR expert (he says he is one anyway), one Federico Tomassetti, and the guy just did some handwaving about how performance isn’t everything. (I kinda suspect that ANTLR fanboys end up saying this a lot…)

    The general view out there does seem to be that ANTLR generated parsers are shockingly slow, which is a problem that many users of legacy JavaCC seem to have run into, thinking they would migrate over to ANTLR. But the other strange aspect of this is that I figured that, being so much slower, at least it would be absolutely correct. Yet I see that the way they handle this problem of the ambiguity between the angle brackets in generics and right shift operators is quite sloppy. That said, I am interested in investigating ANTLR some more to see what it does bring to the table.

Continue the discussion at parsers.org

Participants