Some Niggling Whitespace Issues

A very annoying detail in computing (annoying in particular because the whole thing seems so utterly pointless) is this whole issue of line endings in text files. In Microsoft Windows, the operating system that dominates the desktop, a line ending is denoted by "carriage return" followed by "line feed", or CR-LF for short. (Namely the two characters \u000D followed by \u000A.) However, on Unix systems (and I include Linux when I say this) the system that most servers run, a newline is just the line-feed, i.e. \n. That is the case for Mac as well, or has been for the last couple of decades. Before that, Macs used the carriage return alone with no line-feed -- a lone CR.

Not so long ago, I was trying to think back to what the origin of this whole thing is. Now, here is a little bit of personal background. I am an extremely proficient touch typist. I type at something like 150 wpm. And that is really quite strange really, given how clumsy a person I am generally. However, a big factor in proficiency is being exposed to something at a very young age. I was trying to trace back to when I first learned touch typing and it must have been when I was about eight years old. The primary school I attended had a room with a few old manual typewriters that must have been donated at some point. There were also some (surely also quite old) self-teaching books that explained the basics of touch typing -- which really amounts mostly to keeping your fingers as close to the "home keys" as possible. But somehow this whole thing exerted a kind of fascination on me back then and I was always eager to practice my typing in the typing room. This was surely a great thing from the point of view of the teachers who could get this rather unruly kid out of the way for a while. In retrospect, I think that, for the teachers the typing room was just a place to keep some kids occupied doing something.

While I don't remember the exact model of the old typewriter I first used, it must have been something quite similar to the one pictured above, manufactured by the Underwood Company of New York at some point in the 1940's or thereabouts. This was already quite an old machine when I was exposed to it in the 1970's, but not yet a museum piece, as it is now! Well, not to say that this is the exact model of typewriter I learned on, but it was something not terribly different. I happened on that photo searching on the internet and it looks vaguely familiar.

Now, obviously, those old typewriters did not have any sort of automatic word-wrapping. When you reached the point where it would make sense to start a new line, you would do a "carriage return" by taking the lever on the left and forcefully sliding it to the right. However, you would also need to do a line feed, using the roller on the right side to advance the paper by the necessary amount (that depended on whether you wanted your text to be single or double spaced.) If you did a carriage return with no corresponding line feed, you would be typing over the last line you typed, which would not usually be one's intention. (Well, if you had whited out the line you just typed and now wanted to type something different over that, then it would be your intention, but normally not. And, in any case, I don't think this really has any analogue in terms of typing on a computer in the current day.)

So, the thing is that the carriage return followed by the line feed would eventually be a combined movement that entered one's muscle memory. I have to think this entered the computing world because the first teletype terminals surely conserved many of the old conventions that came from these old typing machines. So that is why the carriage return and line feed would have been separate "control characters" and starting a new line would be the CR followed by the LF. This is where all this came from surely, no? And, as for the rift between CR-LF and the lone LF, that would have come about because some geniuses finally figured out that the CR-LF convention was actually now superfluous. Besides, unlike now, memory was very expensive back then, so using a single byte to end a line instead of two bytes was likely a big deal.

And what about those tab characters?

The old manual typewriters had this mechanism of tab stops. You pressed a button and this would (via some sort of spring-loaded mechanism, I guess) slide the carriage to the right, to the nearest "tab stop". This was kind of useful, I suppose, if you wanted to line up columns. I assume (though I don't recall using it very much) that these old typewriters had some mechanism to adjust the points at which the sliding mechanism would stop. And this is, as best I can tell, the origin of the TAB character (\u0009) as a control character to emulate the behavior of these old manual typewriters. Also, using the TAB character instead of multiple spaces would entail a savings in memory usage -- not any real consideration in the current day world, but a big deal back then.

So there you go. These things originate in conventions that come from the operation of these very old mechanical machines, but really have no particular reason to exist nowadays. Well, to be clear, in serious typography there is bound to be a need for different characters to designate different kinds of whitespace, like a non-breaking space, or some kind of half-width or one-and-a-half width space or whatever. However, the problem with TABs is that they are entirely implementation defined. There is no general agreement on how much space a TAB character represents. I guess the most common convention is to assume that the (now imaginary) TAB stops are at 8 space intervals, i.e. at columns 1,9,17,25... so if you are at column 12, let's say, and there is a TAB, then the next character after that is at offset 17, so that TAB character is effectively 5 spaces. But, when people use TABs to indent source code, it is maybe more common to assume that the tab stops are at 4 space intervals.

All that said, the main point where this becomes an issue is when it comes to reporting error locations in the input, since, if there are tabs on the line, the column at which the error occurs will be different depending on how many spaces you think these tabs represent. (And again, that is essentially arbitrary!)

Anyway, here is the current situation with parsers generated by JavaCC 21. First of all, the default behaviors:

If you do not specify otherwise, line endings are "normalized" to a lone line-feed internally. So, if a line is ended with a CR-LF (o a lone CR, though that is very rare in practice nowadays) it is converted to a lone LF.

If you actually want CR-LF in your internal buffer, you need to specify:

  PRESERVE_LINE_ENDINGS=true;

at the top of your grammar. In that case, it will preserve all the line endings as they were in the source file. If the input file is a total mess, with some lines being ended with CRLF and others with just LF, it respects how it was in the original file. But without setting PRESERVE_LINE_ENDINGS, it would just normalize al the line endings to a lone LF.

Now, when it comes to TAB characters, if you specify nothing, the TAB characters will be in your internal buffer, and there will be no disposition to report the column locations in any special way. Each TAB character is treated as a single horizontal offset, as if it were a single space. This is not usually what you want, but the logic behind this is that, since the system has no way of reading your mind, it just treats the TAB character like any other single character for error reporting purposes.

Now, if you specify:

 TAB_SIZE=n;

where obviously n is an integer, typically 8 or 4 or something like that, then the system will convert the tabs to spaces, based on the idea that the tab stops are n spaces apart. But note that it internally converts to spaces, so you have no actual tab characters in your input, just spaces -- basically the same as your having no CR (\u000D) characters in your input unless you set:

 PRESERVE_LINE_ENDINGS=true;

at the top of your grammar file.

So, by the same token, if you really love these tab characters or you have some weird grammar where they actually are semantically meaningful so you have to conserve them, then you have the option:

 PRESERVE_TABS=true;

which means that it does not convert the tabs to spaces. So, the above is now the current disposition as regards these whitespace issues. Well, at least until I decide on something different and change it, but really I don't think so. I think it will almost certainly stay like this.

Switching to unchecked exceptions (by default)

Speaking of changing defaults, I recently (very recently, just a few hours ago, as of this writing) changed the default code generation for the ParseException class and now, by default ParseException is not a checked exception, i.e. it subclasses java.lang.RuntimeException. The main reason for this is that actual praxis seems to show that the whole checked exception concept is pretty overrated really. Granted, there are some arguments in its favor, but on balance, it's mostly a PITA. You see, as a practical matter, there is usually very little useful that most calling code can do when it catches an exception so it's almost always better just to let the exception bubble up -- eventually to some sort of default handler that deals with the problem, elegantly or not, usually depending on whether the system is in production or at a debug stage. (In any case, you can explicitly catch whatever exception whether it is checked or unchecked.)

However, if you disagree with the above and really like checked exceptions so much, there is a new setting that you can set. Like so:

 USE_CHECKED_EXCEPTION=true;

And then it generates code where ParseException subclasses Exception and also, as a side effect of that, all the various methods that potentially throw ParseException declare throws ParseException in their signature. (In other words, it works like it did before this change.)

So, you see, it's up to you. The goal here is to set defaults that make sense under most conditions. It normalizes the end-of-line to a lone \n if you don't specify otherwise, but you can change that if you wish. As for tab characters, actually, the default, leaving the TAB characters in the input and treating them like a single space for error reporting, this is not a behavior that most people would want. However, since there is no completely unambiguous way to treat the TAB characters, it is left to the user to decide. If you say:

 TAB_SIZE=8;

the tabs are converted to spaces on the basis of tab stops at 8 character intervals. If you want to conserve the tabs in the input, you could have:

  TAB_SIZE=4;
  PRESERVE_TABS=true;

And that leaves the tab characters in the input and reports column locations on the basis of the idea that the tab stops are at 4-space intervals.