Context-Sensitive Tokenization, Next Installment, Activating and De-activating Tokens

Sometimes, when you complete a major code cleanup, features that were previously pie in the sky become low-hanging fruit to pluck. The new feature that I describe here, the ability to activate and deactivate tokens is such a case. It resulted from my rewriting of the lexical code generation that I describe here.

In an earlier article I (rather proudly) outlined my solution (of the time) to a certain rather pesky little problem in Java parsing. In the following:

List<Set<Foo>> listOfFooSets;

we need to parse >> as two consecutive > tokens, while, in other contexts, the >> it does need to be identified a single token. At first sight, it does not seem like this should be so hard to deal with, but it is surprisingly difficult to find a palatable solution. I was quite pleased with the solution I describe there because it was definitely far better than the 3-part kludge that I had been using before that, and way way better than the 5-part kludge that the legacy code used. However, the truth is that it was still a kludge! But now that we can activate and deactivate tokens on an as-needed basis, there is a far more elegant solution.

Well, the easiest way to explain this is with an actual code example. Here is how the ShiftExpression construct is implemented in the current Java grammar:

ShiftExpression :
    AdditiveExpression
    ACTIVATE_TOKENS RSIGNEDSHIFT,RUNSIGNEDSHIFT 
   (
      ( "<<" | ">>" | ">>>")
      AdditiveExpression
   )*
;

The use of the ACTIVATE_TOKENS is fairly straightforward. What it means is that in the following expansion, delimited by parentheses, these two tokens are activated. They are activated at the beginning of the expansion that follows and at the end, the set of active tokens is reset to what it was before.

At the very top of the grammarfile, we have the option:

DEACTIVATE_TOKENS=RSIGNEDSHIFT, RUNSIGNEDSHIFT, RECORD;

These token types are turned off by default and turned on at key moments. So, the RecordDeclaration, new stable feature in JDK 16 is defined as follows:

RecordDeclaration :
    Modifiers(EnumSet.of(PUBLIC, PROTECTED, PRIVATE,    ABSTRACT, FINAL, STATIC, STRICTFP))
    ACTIVATE_TOKENS RECORD ("record")
    =>||
   
   [TypeParameters]
   RecordHeader
   [ImplementsList]
   RecordBody
;

The token for the "soft keyword" record is "activated" at the key point where we need it and everywhere else, the "record" is simply tokenized as an identifier.

Well, that's it. Of course, I'm pretty sure that this disposition is quite generally useful, not just for this specific problem of parsing Java generics -- even if that is the example I keep coming back to when discussing this overall problem of context-sensitive tokenization.

I anticipate that this disposition will significantly reduce the need to define separate lexical states since, in many cases, all we really want is to turn on or off a single token in a given spot. Defining a separate lexical state for that is a rather heavy, inefficient solution. Well, sometimes, a separate lexical state is the natural solution, like if we are embedding JavaScript inside HTML, but it never seemed right to me to have separate lexical states that only differ by a single token or two...

Notable Replies

  1. How does one deal with activation/deactivation when there might be recursion involved? For example, the 3.10 Python grammar has soft keywords match and case, and of course case statements can contain nested match statements. Would it be sufficient to do something like:

    CompoundStatement :
        FunctionDefinition
    (other alternatives elided)
        |
        TryStatement
        |
       ACTIVATE_TOKENS(MATCH)
       MatchStatement 
       DEACTIVATE_TOKENS(MATCH)
     ;
    
    MatchStatement:
       <MATCH> SubjectExpression <COLON> <NEWLINE> <INDENT>
       ACTIVATE_TOKENS(CASE) (CaseBlock)+ DEACTIVATE_TOKENS(CASE) DEDENT
    ;
    
    CaseBlock:
      <CASE> Patterns [ Guard ] <COLON> Block;
    

    Or is there a better way?

  2. Hmmm. I think there is a problem here, and the code above won’t work correctly, but it doesn’t really have anything to do with recursion. I mean, if you know that the soft keyword “match” should be tokenized as a MATCH keyword in that specific spot and everywhere else, it should be treated as a regular IDENTIFIER, it doesn’t really matter that much how deeply nested you are at that juncture. The problem lies elsewhere.

    The above doesn’t work (at least for now, but I’ll be working on the problem!) due to how the default choice resolution algorithm works. Like, in the absence of anything to the contrary, the code is looking ahead one token and deciding whether to enter whatever expansion on that basis, right?

    So, if you have a choice, like:

       IfStatement
       |
       WhileStatement
       |
       MatchStatement
       |
       ReturnStatement
    

    the assumption (assuming that we haven’t put any explicit lookahead anywhere) is that we can decide which choice to enter based on the next token. If the next token is an IF, we go into the IfStatement and if it’s a RETURN, we go into the ReturnStatement and so on. And you can imagine what code is generated for this. Roughly anyway. Something like:
    if (nextTokenType == IF) {…}
    else if (nextTokenType == MATCH {…}

    etc.

    Well, the problem (as things stand) is that the decision to take a given choice is effectively being made before the ACTIVATE_TOKENS is hit. If we decide, in the above code to go into the MatchStatement option based on the next token, we’ll never do it, because the next token, even if it is the string “match”, is being matched as an IDENTIFIER, and then based on that, we decide whether to enter that choice and the ACTIVATE_TOKENS(MATCH) is never invoked! You see the problem? IOW, the choice that involves activating the MATCH token never gets invoked precisely because the next token has already been scanned as an IDENTIFIER and we don’t enter that choice in the first place. You see my point?

    I haven’t actually checked, but I’m about 99% certain that the code you wrote doesn’t work (which is NOT a reproach, mind you. It probably should work…)

    Now, I think it would work if you wrote:

            SCAN 0 {getToken(1).getImage().equals("match")}  =>
            ACTIVATE_TOKENS(MATCH)
            MatchStatement
    

    We add that extra semantic lookahead that if the next token is the string “match” (irregardless of the type, so it can be an identifier) we enter the following expansion, then we turn on the token type MATCH and then we enter MatchStatement! And “match” here is a MATCH token because the ACTIVATE_TOKENS caused us to retokenize the tokens that follow!) (And probably the DEACTIVATE_TOKENS should just occur write after the <MATCH>)

    I’m pretty sure that works, for example, but I don’t consider it satisfactory. It requires people to be entirely too clever and I don’t want to market the tool as being solely for super clever people!

    Oh, another way to get this to work would probably be:

       MatchStatement :
           ACTIVATE_TOKENS(MATCH) <MATCH> =>||
           DEACTIVATE_TOKENS(MATCH) 
           etc...
    

    The reason that would work is that, since we put in an up-to-here marker in the MatchStatement, it generates a lookahead routine and so we’re not doing the simple one-token-lookahead default, so we get around the catch-22 problem I describe initially. But that still requires people to be clever…

    As for the CASE, I don’t think there is much problem with it, since it’s not occurring at a choice point. Except, I think it would be better to put the activate/deactivate in the CaseBlock production.

       CaseBlock :
           ACTIVATE_TOKENS(CASE) <CASE> DEACTIVATE_TOKENS(CASE)
    

    which is a bit verbose, but should work. But it works because the CaseBlock is not at a choice point, so we’re not hitting this catch-22 problem.

    I’ll close this note here, but it’s not a full answer, because, while writing this, I realized that there are some rather screwy aspects to all this that probably need to be revisited, so stay tuned…

  3. D’oh! Of course, should’ve seen that :blush:

    On what basis would the MatchStatement production be entered in this case from CompoundStatement?

  4. Well… it’s not so easy to reason about these things. I myself find that I still make similar sorts of mistakes. And that’s at a point where I know the whole system more intimately than anybody! So, for somebody in your position, just coming into this…

    But the other thing is that this whole conversation makes me realize that this whole thing is implemented incorrectly. I think the ACTIVATE/DEACTIVATE stuff has to work on an expansion as a whole and be implemented in code with a try…finally… block. It should really be something more like:

       ACTIVATE FOO, BAR
       (some Expansion)
    

    And that should generate something like:

       activateTokens(FOO, BAR);
       try {
                 expansion code
       }
        finally {
             deactivateTokens(FOO, BAR);
         }
    

    Or, if it was DEACTIVATE instead of ACTIVATE, it’s the other way round. We deactivate at the start and re-activate the same tokens in the finally block.

    You see, the problem is that, as things stand, it just doesn’t quite work correctly in terms of a lookahead routine. A lookahead routine jumps out once it realizes that it has failed, but generally speaking, if it activated or deactivated a given token, it should restore the state of the world to what it was before. Something like:

      SCAN MatchStatement =>
    

    should not have the side effect that it activated the MATCH token and now, even asuming that the scanahead failed, we now have MATCH activated. The only way for this to work robustly is for the ACTIVATE/DEACTIVATE to be in a try-finally construct. It’s just implemented wrong, basically.

    Well, there’s the other aspect, which is that, in terms of LL(1) sort of logic, there needs to be a recognition that if an expansion starts with an ACTIVATE_TOKENS or DEACTIVATE_TOKENS, even if the expansion or production just requires 1 token of lookahead, we still generate a custom lookahead routine for it. That would get rid of the Gotcha that we were discussing. And that is not too hard. I initially thought that was the full solution but then I started thinking and I realized that there is a deeper problem with all this. But I figure it’s still early and I can re-implement this feature correctly!

  5. I’ll hold off looking at any match-statement related stuff until you’ve done the re-implementation.

Continue the discussion at parsers.org

5 more replies

Participants