“You can’t get there from here!” — The Problem of Context-Sensitive Tokenization

(N.B. Note added 13 June 2021: This article is useful in terms of understanding how to add token hooks to code. However, in terms of solving the specific problem outlined, the article is obsolete. See here for the updated solution.)

Since I picked up my work on the JavaCC codebase at the end of 2019, various people have broached to me this question of strings that can (or should) be broken into tokens differently based on the context where they are encountered.

I have to admit that it took me a while to grasp just how fundamental this problem was. Also, I came to see that, due to basic design flaws and limitations that have never been addressed, this whole problem has basically been intractable in JavaCC from the very beginning. Below, I'll explain why, and also how JavaCC21 now does provide the basic infrastructure to address the issue. I should warn the reader that this post will be fairly involved. However, I think that anybody who anticipates using JavaCC in any non-trivial way really should bear with me. A careful reading of this post will prove worthwhile. (If not, the article does come with a money-back guarantee...)

Now, to frame this whole question of context-sensitive tokens, I think it is far better to motivate the conversation with a concrete example than to start with some abstract general theory. I think the best case study would be Java generics, since pretty much all JavaCC users are seasoned Java programmers. (This should change in a medium-term time frame, but is certainly the case for the time being.)

So here is the case we shall analyze:

The Angle Brackets in Java Generics

Basic problem: the strings ">>" and ">>>" need to be treated entirely differently depending on the context in which they appear.

Example: consider the following generic type:

 Map<String, List<String>>

In the above, the angle brackets, "<" and ">" are being used to open and close a list of type parameters and since they can nest, the two closing characters ">>" are actually two separate tokens.

By contrast, in the following statement:

int j = i >> 3;

the ">>" is actually a single token, the right signed shift operator. (The right unsigned shift operator, ">>>" is also problematic, of course.) Clearly, any Java parser must distinguish between the above two uses of ">>". This problem was one of the first things I had to resolve when I picked up my FreeCC work at the end of 2019. After over a decade of neglect, FreeCC did not support (fully) any remotely recent version of the Java language and my first priority was to remedy that state of affairs. So this problem of the ambiguity between the closing angle brackets and the right shift operators was one of the first things I had to solve. I did so, but my solution was not very satisfying to me, and I made a mental note to revisit the problem. What I implemented was (to my taste) a horrific kludge. It did have the merit that... well, it does work...

Context-dependent parsing of ">>" and ">>>", solution I

The solution I ended up implementing was in three parts:

  1. define the RSIGNEDSHIFT and RUNSIGNEDSHIFT tokens in a separate phony lexical state
  2. create a grammatical production to specially handle 2 or 3 consecutive ">" operators
  3. use that grammatical production at the appropriate place in the grammar where we would normally have used just a reference to the actual tokens.

The first step above looks like this:

<PHONY> TOKEN :
{
    <RSIGNEDSHIFT : ">>" >
   |
   <RUNSIGNEDSHIFT : ">>>" >
 }

This lexical state is called PHONY to provide a strong hint to anybody reading the grammar that the parsing machinery is never actually in this lexical state. The way it works is that we always tokenize ">>" and ">>>" as a sequence of two or three ">" tokens.

That is the first step, but it is the second step that provides the little magic trick to square the circle. This involves defining a grammatical production called ShiftOperator as follows:

void ShiftOperator() #void :
{
   Token first, second, third=null;
}
{
   first=">"
   second=">" 
   [third=">"]
   {
      Token result = null;
      if (second.getBeginColumn() != first.getBeginColumn() + 1) {
        throw new ParseException("Shift operator cannot have spaces or comments inside!\n" + getInputSource() + ":line " + first.beginLine + ":column " + first.beginColumn);
      }
      if (third!=null) {
        if (third.getBeginColumn() != first.getBeginColumn() + 2) {
            throw new ParseException("Shift operator >>> cannot have spaces or comments inside!\n" + getInputSource() + ":line " + first.beginLine +":column " +first.beginColumn);
        }
        first.image = ">>>";
        first.kind = RUNSIGNEDSHIFT;
        first.next = third.next;
        popNode();
        popNode();
      } else {
        first.image = ">>";
        first.kind = RSIGNEDSHIFT;
        first.next = second.next;
        popNode();
      }
    }
  }

What the above grammatical production does is that it handles two or three consecutive ">" tokens that come from the lexical machinery and merges them into a single RSIGNEDSHIFT or RUNSIGNEDSHIFT operator.

The third step above is to use this ShiftOperator production where we would normally use the tokens. Like this:

void ShiftExpression():
{}
{
    AdditiveExpression()
    (
        LOOKAHEAD(2)
        (
        "<<"
        |
        ShiftOperator()
    )
    AdditiveExpression()
    )*
}

Again, the above 3-step solution does have the merit that it works. However, it was deeply unsatisfying to me and I made a mental note to revisit this issue and find a cleaner solution.

I eventually did put in place a better solution, just in the last few days, but it turns out that this was only possible because of some quite significant refactoring/enhancement of the codebase that had occurred since the earlier hack. I am now pretty sure that, in legacy JavaCC, it is simply impossible to implement a significantly better solution than the 3-step kludge I describe above. The legacy JavaCC project actually does have a different solution to the above problem, but to my taste, it is actually somewhat worse than the one I outline above. That might be a question of taste, but regardless, it is also a messy 3-part hack.

The Current Solution

The current solution involves using a TOKEN_HOOK method to handle this case. It looks like this:

INJECT PARSER_CLASS : {
  private Token TOKEN_HOOK(Token tok) {
    TokenType type = tok.getType();
    if (type == RSIGNEDSHIFT || type == RUNSIGNEDSHIFT) {
        if (isInProduction("TypeArguments", "TypeParameters")) {
     // If we've entered the TypeParameters or TypeArguments production, we need to split
     // a ">>" or ">>>" into 2 (or 3) GT tokens.
        Token gt = Token.split(tok, 1, GT, GT);
        if (type == RUNSIGNEDSHIFT) {
          Token next = Token.split(gt.getNext(), 1, GT, GT);
          gt.setNext(next);
          gt.setNextToken(next);
        }
        return gt;
      }
   }
   else if (type == GT) {
       Token next = tok.getNextToken();
       if (next != null && next.getType() == GT &&   !isInProduction("TypeArguments", "TypeParameters")) {
  // In this case we do the reverse. We merge 2 (or 3) GT tokens into a right shift operator
           Token nextNext = next.getNextToken();
           Token merged = Token.merge(tok, next, RSIGNEDSHIFT);
           if (nextNext != null && nextNext.getType() == GT) {
              merged = Token.merge(merged, nextNext, RUNSIGNEDSHIFT);
      } 
      return merged;
    }
  }
  return tok;
  }
}

What the above token hook routine does is that it checks whether we are in a TypeParameters or TypeArguments construct. If YES, we are AND the next token in the stream is a ">>" or a ">>>", then it splits that token into two (or three) ">" separate tokens.

It also must deal with the opposite case: if we are NOT in a TypeParameters or TypeArguments construct AND we have two (or three) adjacent ">" tokens, we merge them into a single ">>" (or ">>>") token.

(N.B. The Token.split and Token.merge methods used in the above method were added to the Token.java boilerplate, since I anticipate them being generally useful to others.)

With the above TOKEN_HOOK method in place, the rest of the grammar that refers to these things can be written naturally. Thus, there is now no need for a separate PHONY lexical state. The RSIGNEDSHIFT and RUNSIGNEDSHIFT tokens are simply defined along with everything else in the default lexical state.

There also is no need for a separate ShiftOperator grammatical production. Thus, the ShiftExpression production is now just written in a completely natural way, as follows:

ShiftExpression :
  AdditiveExpression
  (
      ("<<" | ">>" | ">>>")
      AdditiveExpression
  )*
;

(N.B. The above production is written in the new streamlined syntax)

The Devil is in the Details

The current solution is still a bit crude and will probably be refined over the coming period by introducing certain much-needed constructs in JavaCC itself. At the moment, we are still dropping into Java code; ideally, the whole thing could be expressed in JavaCC itself.

However, the current solution is solid and correct in a way that the previous hack was not. Moreover, this improved solution is simply not feasible in legacy JavaCC (and was not feasible in JavaCC21 until very recently) for a couple of key reasons.

  1. Due to the way it is implemented, the legacy CommonTokenAction hook cannot be used to handle the above case.

  2. Legacy JavaCC offers no way to query whether we have entered a given grammatical production. That is what the isInProduction method in the above TOKEN_HOOK method does.

The key to understanding the first point above is in the first line of the code injection:

 INJECT PARSER_CLASS {...}

This TOKEN_HOOK method is being injected into the XXXParser class, NOT the XXXLexer class!

You see, the generated parser has a getNextToken method that looks something like:

private Token getNextToken() {
  Token result= tok.getNext();
  if (result== null) {
      result= token_source.getNextToken();
      tok.setNext(result);
  }
  tok.setNext(result);
  return result;
}

The real first-order impedance here is this: if the currentToken already has its next field set, the token_source.getNextToken() method is simply never called! So, you obviously can't solve this problem with a CommonTokenAction hook because it simply won't be invoked! At least, not reliably. You see, if a lookahead routine has already scanned ahead and tokenized ">>" as a single RSIGNEDSHIFT token, when it comes time for the parser machinery to pull the next token off the stream, as in the above method, it will return that token and your token hook method never gets the chance to do its thang: namely, to break it into two ">" tokens.

This is the key reason (or actually one of two key reasons, it turns out) why legacy JavaCC simply cannot solve this problem in a clean way. You would always need something like the rather grotesque 3-part hack that I originally came up with, and is outlined above.

This is a key concept, but also, properly understood, it could be considered to be part of a larger problem: there is simply no concept of rewinding the parsing/lexing machinery in legacy JavaCC.

You see, once you have context-dependent tokens and an arbitrary amount of lookahead, if the lookahead fails, you would very frequently need to rewind the token stream or somehow revisit the cached tokens, as we do here. In this specific case of Java generics, if you had a lookahead that scanned ahead to see if what follows is a certain kind of expression, it could identify ">>" as a single token. If the lookahead fails and now we are going to parse the same input as if it was a type declaration, it needs to have the smarts to break the ">>" into two separate ">" tokens.

But again, there is simply no point at which to do that! The token hook routine needs to be inserted in the appropriate place in the parser (NOT the lexer) code.

This is fixed in JavaCC21. In the above getNextToken routine, our TOKEN_HOOK routine (injected into the PARSER_CLASS!) gets invoked in the right spot. So it now looks like:

final private Token nextToken(Token tok) {
    Token result= tok.getNext();
    if (result== null) {  
        result= token_source.getNextToken();
        tok.setNext(result);
    }
    result= tokenHook$Java_javacc_line_42(result);
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    tok.setNext(result);
    return result;
}

In the general case, this is where the token hook method has to be inserted! At least if you want to use it to resolve this kind of problem.

(N.B. JavaCC21's TOKEN_HOOK feature is also significantly improved because you can define multiple TOKEN_HOOK methods and they will be inserted in the right place. The placeholder "TOKEN_HOOK" gets replaced with a unique method name. So you can define more than one! This is explained here.)

Note however, that being able to inject the token hook method in the right place is still not quite sufficient because... well... legacy JavaCC is simply too underpowered: it gives you no robust, standard way of knowing whether you are in a given production. That is what the call to isInProduction is about:

if (isInProduction("TypeParameters", "TypeArguments"))

The isInProduction method is leveraging the same machinery that is used by JavaCC21's lookbehind predicates. It works from within a regular parsing routine or a lookahead routine.

There are some various other points to be made about this, but this post may be reaching (or have already reached!) a point of diminishing returns. So, I'll try to wind it down here.

Concluding Remarks and Recap

It occurs to me that the TOKEN_HOOK method that I post above may still look pretty hairy to most readers. However, once you understand its logic, it should not be so intimidating.

The basic problem is that the input ">>" or ">>>" means something entirely different depending on the context in which it occurs. Once we have a syntactic lookahead routine that tries to scan this as the first case, we have a big problem if that routine returns false, because the way the code was structured (for the last 24 years or so) there was no way to inject a token hook (what was called CommonTokenAction in legacy JavaCC) to deal with the case. Or, in other words, the specific cases that CommonTokenAction should be able to handle could not possibly be handled that way!

Like... dude.... you can't get there from here!

The TOKEN_HOOK method above handles two basic cases.

  1. The parsing machinery has scanned ahead and identified the ">>" as a single token, but our token hook method needs to break it into two ">" tokens.

  2. It scanned ahead and identified the ">>" as two separate ">" tokens but it now realizes we are in an arithmetic expression where ">>" (or ">>>") means a right shift operator, so it has to merge the adjacent ">" tokens into a single token.

Of course, depending on the exact problem you are trying to solve, the solution could be something quite different. Actually, the more astute readers will realize already that the above discussion is not really specifically about solving this problem of the angle brackets in Java generics. That is simply one example of a much more general problem. Even if, at first blush, your problem is quite different from this one, if you are running into a wall because you need input to be scanned differently based on context, your first step should be to carefully go over the above example and make sure you understand it. Admittedly, there are some details that I have glossed over here. You can expect a later post (or two) soon to tie up those loose ends. And, of course, you are more than welcome to sign up on the Discourse forum and pose any questions or offer suggestions