JavaCC 21 now has a Preprocessor!

It had been in the back of my mind nagging me for some time, thinking that we would eventually need some sort of preprocessor functionality moving forward, to support output for different programming languages (besides Java) and actually, even different Java targets. I had also been looking more closely at C# recently and finally, I just decided to copy the way the C# preprocessor works, which is much more limited (and sane) than the full C/C++ preprocessor.

In fact, this JavaCC 21 preprocessor only implements the part of the C# preprocessor that deals with turning on and off regions of the code based on some conditions (conditional symbols) that you define. So, basically, you can write things like:

 #def testing

  #if foo

   #undef testing

   #define stable

  #endif

  #if !(testing || debug)

     something 

  #elif stable

    And now something else!

  #else

     And now for something completely different!

  #endif

Well, as you probably see already, the preprocessor is its own separate little mini-grammar. (It is expressed in a couple of hundred lines here).

So it has its own rules (that I did not invent) like: all of these pre-processing directives that start with # must be on their own line. In the above, the #if-elif..else...#endif structure has to be valid. If a closing #endif was missing, for example, it will complain. Same old... same old...

Note, however, that these constructs are not really part of the syntactic or lexical grammar of a JavaCC grammarfile. All of the preprocessing is best thought of as pre-lexical. The way it works is that the preprocessor runs over the source file and simply builds up the information (in a BitSet instance) that marks which lines in the source file are turned and off. And then, when the lexical machinery reads in the code to be lexed (and parsed) the lines that are marked as ignored are simply skipped. Neither the parser nor the lexical machinery sees any of those ignored lines and behaves as if they weren't there. Well, there is a key difference. The line number information stored in Tokens and Nodes is correct based on the location in the original file. So if you have:

 1. #if false
 2. blah blah blah
 3. #endif
 4. Foobar : "foo" "bar";

The Foobar production and the tokens inside it know that they are on line 4, not on line 1 as they would be if we really stripped out the first three lines and fed the remaining code to the parser.

Can I use it?

Some readers may already be wondering whether this is re-usable in their own projects. And the answer is that it basically is. You can see here the key point where the JavaCC grammar uses the Preprocessor grammar to get the BitSet of line markers that turn off the various line ranges.

In fact, I think this is generally useful enough that it will eventually just be a settings toggle, something like USE_PREPROCESSOR=true and your DSL will automatically have this preprocessor functionality. But that is not implemented yet. But it is already not very hard to incorporate this into any other project.

Internationalization... Not

At the moment, all the conditional symbols have to be in 7-bit ASCII. The regexp for the conditional symbols currently looks like this:

<PP_SYMBOL : (["_", "a"-"z", "A"-"Z"])(["_", "a"-"z", "A"-"Z", "0"-"9"])*>

It would have been easy enough to allow people to have conditional symbols in full unicode, so as to write things like:

 #define 你好

Or:

 #if отладка
  ....
 #endif

However, there is really still no clean way of including the whole Unicode definition of an identifier for something like this. The whole preprocessor grammar is only a couple of hundred lines and it just felt weird to copy-paste an identifier definition that is longer than that.

I like the idea of treating non-English speakers as full citizens, of course, but I need a cleaner way of reusing the various internationalized definitions of Identifiers and such that use the full Unicode character set. I anticipate that when I have that in place, this will be one of the first places I apply it.

So, well, aside from not currently supporting full Unicode in the conditional symbol names, we also only have #define, #undef, #if/#elif/#endif directives. The various directives (mostly just used internally in C#) such as #pragma, #line, #region/#regionend and some others are all just ignored at the moment.

By the way, a directive that does not even exist in C# is just passed through. So the line:

#foobar blah blah

is just passed through to JavaCC since there is no #foobar instruction in the C# preprocessor. On the other hand, the line:

 #warning This is a warning!

is ignored, but JavaCC 21 does nothing with it. At least for now...

ATTEMPT/RECOVER is back

In other matters, I put back the ATTEMPT/RECOVER construct. It should work but is largely untested. The syntax is slightly changed. You write:

  ATTEMPT Foo Bar 
  RECOVER {some Java code...} 

OR:

  ATTEMPT Foo Bar 
  RECOVER (Baz Bat)

So, if you use curly braces, what is inside is Java code and then if you use parentheses, it is a JavaCC grammar expression. Note that you can put a Java code block in any grammar expansion anyway, so you can write:

  ATTEMPT Foo Bar
  RECOVER ({some java code} Baz {more java code} Bat {even more java code})

The above construct will parse (well, attempt to match) the expansion, in this case Foo Bar, and then if a ParseException is thrown it tries to recover with the code after RECOVER. BUT... only after rolling back the state of the world to before it entered Foo!

That is the key difference between ATTEMPT/RECOVER and the older try-catch.

Well, there are some other new features that are implemented, but I'll have to document them in a later article.

Notable Replies

  1. Internationalization… Yes!

    I tweaked things so that the preprocessor symbols can be anything that Java recognizes as an identifier, so you can write:

      #def 中文
    
      #ifdef 中文
       ...
       #endif
    

    Etcetera. I just broke out a separate include in the Java grammar that only contains the big definition of of a Java identifier, and simply reused it via include in the Preprocessor grammar.

    Well, see here and here if you’re interested.

  2. Since it was such low-hanging fruit, I just added a feature that you can pass in preprocessor symbols on the command line when invoking JavaCC. So you can write something like:

      javacc -p strict_syntax,debug MyGrammar.javacc
    

    and the symbols “strict_syntax” and “debug” will be defined, so you can do:

     #if strict_syntax
         ....
     #else
         something not so strict presumably
     #endif
    

    Those symbols start off being defined, but you can also later “un-define” them, as in:

     #undef debug
    

    I really think that for any language/parsing/grammar sort of project beyond a moderate level of complexity, this sort of preprocessing ability will prove itself to be indispensable. People will be wondering how they did without it. And, of course, it should work nicely in conjunction with INCLUDEs.

      #if strict_syntax
    
            INCLUDE ("StrictlyDefinedSyntax.javacc")
    
      #else
    
           INCLUDE ("LooselyDefinedSyntax.javacc")
    
      #endif
    

    Well, large proportion of new features like this are implemented because I anticipate using them myself in internal development!

Continue the discussion at parsers.org

Participants