It’s Official! The Gigabyte is the new Megabyte! Hurrah!

Some years back, I was traveling with a couple of Spanish friends, and this was late 2012. We were in Kunming, China and came across a poster that advertised some live music. I was fumbling around for a pen and a bit of paper to jot down the information. I thought to ask my traveling companions whether either of them had a pen, when one of them whipped out his smartphone and simply photographed the poster. For some reason, this was a revelation to me. Of course, I also had a smartphone, but somehow, it did not occur to me to do that. In my mental world of the time, to note down the information on the poster, I needed to write it down somewhere.

Well, I try to live and learn, so I think that was the last time that I fumbled around for a pen and paper in a similar situation. I've thought about this on several occasions since then, and one thing that occurred to me was that, for people of a much younger generation, the whole thing would be completely reversed. Not only would they photograph the poster, but it would not even occur to them to do anything else!

But finally, all of this goes to show that that old habits die hard. This particularly applies to old patterns of thought...

I recently came across this page that traces the price of a megabyte of memory from 1957 onwards. I found this really quite thought-provoking generally, but particularly in the context of JavaCC development. The early predecessor of JavaCC (not just JavaCC but any other parser generator) is the venerable YACC developed at Bell Labs in in the 1970's. In 1975, the year that YACC was first released, judging by that page I link above, memory prices were in a nose-dive. The first reported price there from 1975 is $421,000, but by the end of the year, that was down to just under $50,000. Specifically, you could buy an Altair 4K static RAM board for a mere 195 dollars. So, if you bought 256 of them to cobble together somehow to have 1MB of RAM, that would set you back $49,920. (Admittedly, I suppose that, unless you're really a piss-poor shopper, if you were to buy 256 of those boards, you could negotiate a bulk discount, so it would cost significantly less, but still some serious cash... something like the price of a house at the time!)

In 1996, the year that JavaCC came out, the price of RAM was also in a vertiginous drop. A megabyte of RAM started the year at a bit under $30 and ended the year a bit over $5 a megabyte. In short, between YACC and JavaCC, the price of RAM had already gone down by about 4 decimal orders of magnitude. 10,000 to 1.

In spring of 2008, when I took my first run at hacking the JavaCC source code, the price of a megabyte of RAM was about 2 cents. It had come down about 2 decimal orders of magnitude in the dozen years since the initial JavaCC release. Or 6 decimal orders of magnitude since the YACC era. A million to one price drop.

Now, at the time of this writing, the price of RAM seems comparatively stable. The last price given on that page is from March 23 of this year (2020) and is practically unchanged from the price a year before, approximately $0.0035, about a third of a cent. So, from JavaCC to JavaCC21, i.e. from 1996 to 2020, we have, approximately, a thousand-fold drop in price. These are understatements, by the way, because I don't think the page takes inflation into account! So the real price drop is probably more like 4 decimal orders of magnitude than 3. Thus, if the point of comparison is between now and 1996, to say that the Gigabyte is the new megabyte is actually, if anything, an understatement!

One funny aspect of all this is that, to figure out the price of a megabyte of RAM in the mid-seventies, you have to look at the price of a 2k or 4k board, say, and extrapolate upwards. However, now, to figure out the price of one meg, you have to look at the price of a 2GB or 4GB stick and extrapolate downwards! (Way way down, to a fraction fo a penny!) I'm writing this little essay on my regular desktop work machine, which is nothing very special, a 2014 vintage iMac. It has 32 GB of RAM. (Why that amount specifically? Simple reason. Because that is the most you can put in there!) I mean to say, memory is so dirt cheap that it would make little sense not to spend the approximately $100 to expand to the maximum. (If I could put 64 GB in that model, it would probably have that!)

In very recent private correspondence, Sriram Sankar, the original developer of JavaCC, told me that there are many things he would have done differently if he had to do it over again. Now, Sriram didn't actually need to tell me that. I mean, given that, since his original work, the gigabyte is the new megabyte, it would be extremely surprising if he had said that he would do everything the same now! The thousand-fold drop in the price of RAM is just too big a game-changer, and that is without even mentioning the performance of CPU's, or the fact that Java technology became more efficient even on top of all of that! Well, unfortunately, Dr. Sankar has not been involved in JavaCC development for a very long time, and if the people who took control of the project ever felt the need to re-examine some of the underlying assumptions, there is certainly no sign of it.

Here is just one striking example. The legacy JavaCC tool generates a SimpleCharStream.java file that starts with a 4k buffer in which to read in characters. If more memory is needed, it grudgingly expands the buffer in 2k increments. A correspondent a few months ago mentioned this to me and told me that this was a show-stopper for him because his application had Tokens that were up to a megabyte in length. His JavaCC generated scanner ground to a halt because, to get up to a 1 megabyte buffer, it had to be expanded and recopied about 500 times.

My own fix to this problem was two-fold actually. In my version of this same code, I start the buffer at 8k instead of 4k, but that's trivial. More importantly, when the buffer needs to be expanded, I double its size, so it goes from 8k to 16k to 32k to 64k etctera. So, to get up to 1 MB, it needs to expand the buffer about seven times. Moreover, it would only be the last few calls to ExpandBuff() that are particularly expensive, I suppose, going from 128K to 256K to 512K to 1024K. But just imagine the older disposition, the last 100 calls or so to ExpandBuff(). It would create a new buffer of 802k and then recopy the 800k buffer over there, and then create a 804k buffer and recopy and then an 806k buffer...

Well, clearly this is a bug. It must be elementary computer science that in such a case the buffer size needs to expand geometrically, not linearly. (I don't know for sure because I am utterly self-taught in computer programming, but I suspect this.) Regardless, this buggy code is all over the place. Just do the following Google search: "SimpleCharStream.java site:github.com" and you get over 800 hits. (And that's just on Github and just on projects that check in generated code into their version control repository, which IMHO they should not be doing anyway!) And some of these SimpleCharStream.java files were checked in just a few months ago. Here is one example:  checked in on 13 December, 2019.

Now, granted, most people don't have huge Tokens so they don't run into this, but it is a bug. And okay, bugs happen, so big deal. (Well, this particular bug is quite a bit older than my teenage daughter.) But, aside from that, I think it is important to understand that the origin of the bug lies not so much in any coding error, but in some (by now) quite anachronistic thinking about how expensive memory is! The legacy JavaCC code is replete with things like this that reflect this mind-set that memory is something extremely precious and scarce and thus, there is a willingness to jump through all sorts of hoops (write very convoluted code) to save 1k or 2k or even just a few bytes of memory usage here and there.

Now, above I said that my solution to this problem was two-fold and I only mentioned one solution, which was doubling the buffer when needed rather than increase its size linearly. The other solution (really, the more important one) is that I simply re-implemented this part of the code and moved towards a disposition where, by default, we just read in the entire file into memory. You can still have the older scheme of only reading in a 4k (now 8k) buffer (that expands geometrically if necessary) into memory at any given point. You can set this by using the setting I introduced, HUGE_FILE_SUPPORT, but if that is not set (and by default, it is not) all of the parser/lexer machinery just assumes that the entire file is available in memory.

You see, the "21" in "JavaCC 21" is not a version number. It is meant to stand for "21st century". So JavaCC 21 development dispenses with these 20th century shibboleths (which is what they are basically) that memory is so expensive and there is no code so convoluted and opaque that we would not write (and subsequently maintain) to economize on a bit of memory. (Never mind that a megabyte of RAM now costs a fraction of a penny!)

Now, I anticipate that the HUGE_FILE_SUPPORT option will be there for the foreseeable future, probably indefinitely, since there always will be legitimate cases where you don't want to read the entire file into memory. However, the assumption is that these cases are rare. (And getting rarer.) A tool like JavaCC is primarily for munging text. And the fact is that text files are usually not that big. The really humongous files on one's hard disk are invariably binary files, like video and such. For example, the full text of Tolstoy's War and Peace (surely one's archetype of a one massive honking big book) is just a few megabytes. (A bit over 3 megs as I recall). Meanwhile, a single photo (depending on your camera settings) sitting on your phone is easily bigger than that and most young people will snap hundreds of such selfies without even thinking about it.

Regardless of all that, the important thing to understand here is that the newer features being added to JavaCC 21 will pretty much invariably assume that the full text of the parsed file is in memory. You see, the bottom line is that there are all sorts of interesting things one can do, features one can add, but they almost all involve having access to the full text being parsed. In particular, error handling, fault-tolerant parsing, incremental parsing... we just have to assume that we can scan forward or rewind backward arbitrarily in the file that we're working on. Thus, for example, the new ATTEMPT/RECOVER feature that I implemented recently is only available if HUGE_FILE_SUPPORT is unset. And this will be the case with just about all new features in the pipeline. If you were to insist on setting HUGE_FILE_SUPPORT to true, you would still have a superset of the legacy JavaCC functionality basically, (well, aside from some things I removed that are pretty much useless anyway) but in that case you wouldn't get the benefit of newer features being added. (And quite possibly, you don't need them. To each their own... It just behooves us to have all of this clear.)

In this vein, I would add that the baseline assumption of JavaCC 21 is that the normal usage of the tool is to build a tree. Again, you can turn that off, via TREE_BUILDING_ENABLED=false, but the default, our baseline assumption, is that a parser builds a tree. (And then one typically traverses the tree in some sort of visitor pattern. We believe this to be the typical usage.) Now, the funny thing about this is that if you're going to build an AST of the whole file, you're already effectively storing the file's contents in memory, so this whole disposition with the itty-bitty (by today's standard) 4k buffer never made any sense at all in conjunction with using JJTree! And since the default in JavaCC 21 is effectively to use JJTree (even though it no longer exists as a separate tool, but I mean that functionality...) it actually makes perfect sense for the default to be just to read the whole file into memory. So, the constructors for the generated Parser and Lexer classes that took a java.io.Reader as an argument are still there, but there are also constructors that simply take a String (a CharSequence actually, but that's a detail...) and we anticipate that in typical usage moving forward, those will be the ones more used. You simply read in the file yourself into a StringBuilder or some such stringy object (that implements CharSequence) and you pass that to the constructor. Or, in other words, the normal usage is that what you pass into a Parser's constructor is just text. The constructors that take a Reader as an argument may ultimately be deprecated, at least assuming that you don't have HUGE_FILE_SUPPORT set.

So, in closing, I would say that this is one rather odd aspect of legacy JavaCC. This is a project with a history of over 20 years, a period in which, quite literally, the gigabyte has become the new megabyte, and none of these sorts of assumptions have ever been re-examined. Actually, the situation is more pathological than that. My correspondent who told me about that bug with the buffer being expanded linearly told me that he had offered a bug fix to the project maintainers, but they insisted that it was necessary to introduce a setting so that people could still have the older behavior! My correspondent told me that he simply threw up his hands in disgust at this point. It's fascinating, of course, because the older behavior was just buggy. There is no conceivable reason to want to preserve it via an optional setting or otherwise! The psychology of all of this is fascinating but seeing as I have no training in psychology, I shall refrain from going on about it. I have no formal training or qualifications in computer science either, but I suppose it is best to stick to one such field at a time!

Start the discussion at parsers.org