April 02, 2006

Parser generators : ANTLR, SableCC, JavaCC, bison at a glance.

One looks towards parser generators (or compiler compilers) when there is a need to translate something in one language to other language. Basically you need to read a sequence of characters, make a data structure which represents the same. For many interesting problems, you want to build a tree. Using this, one can generate a new sequence of character, which is translation of input or just do things on that data structure.

A grammar (context free grammar, technically speaking) says what type of children a node can have. The leaves of tree cannot have any nodes under them and they directly correspond to the characters in the input. Looking at the input, we try to construct a tree which complies with the rules declared by grammar. If only one such tree can be made for any valid input, then the grammar is called unambiguous.

To construct the tree, we have many choices for selecting nodes, which can lead to the leaves to correspond to the input sequence seen so far. If we try to construct it from leaves, it is called bottom up approach. This approach usually uses a method of construction called LR parsing. If we attempt to build the tree top-down, the method usually used is called LL parsing.

Now let us discuss about the free tools that we have at our hand. Given a grammar and instructions on what to do with various nodes, these tools generate a parser.

ANTLR is derived from good old C/C++ PCCTS tool. Antlr has lots of features and can generate the parser in Java or C/C++. It uses top down approach to parsing.

JavaCC is a tool used in many applications, which is much like antlr, with few features different here and there. However, it just generates Java code.

The top down parser generators mentioned also support predicates, using which you can try to 'peep' in upcoming input. This is against the rules of game of context free parsing, but convenient in practice.

SableCC is a bottom up parser, which takes an unconventional and interesting approach of using object oriented methodology for constructing parsers. This results in easy to maintain code for generated parser. However, there are some performance issues at this point of time. It generates output in both C++ and Java

GNU Bison is classical bottom up parser. It generates C language output. It has been the standard tool on many unix operating system distributions. Now, it has also acquired the capability of parsing grammars with GLR method. It's implementation of it is not the best around, but we may have a look at lesser used parsing methods some other time.

There is a plethora of such tools. refer to Wikipedia - compiler compilers for details.

No comments:

Post a Comment