Issue #6 of the Xtext Corner is about some particularities of the parsing process in
Xtext. As I already mentioned a few times in the past, Xtext uses
Antlr under the hood in order to do the pure parsing. This is basically a two step process: At first, the input sequence of characters is split into tokens (often referred to as terminals) by a component that is called the lexer. The second step is to process the resulting list of tokens. The actual parser is responsible for that stage. It will create the abstract syntax tree from the token stream.
This divide-and-conquer approach is mostly called parsing altogether so the distinction between lexing and parsing is quite encapsulated. Nevertheless, the Xtext grammar definition honors both aspects: It is possible to define (parser) rules that are processed by the parser and it is also possible to define terminal rules. Those will be handled in the lexer. So the when should I use parser rules and when should I use terminal rules?
Production Rules also: Parser Rules
The obvious case and just for the sake of completeness: Production rules will yield an instance in the abstract syntax tree. These can only be implemented by the parser thus there is no question whether to use terminals instead. Production rules are the most common rules in almost every Xtext grammar.
Data Type Rules
Those are a completely different thing even though they are handled by the parser, too: Where ordinary parser rules produce instances of EClasses, data type rules will return data types (you did not guess that, did you?). Data types in the sense of Xtext and its usage of the Eclipse Modeling Framework are basically primitive Java types, Strings or other commons like BigDecimal or enums. The parser will not create those on its own but rather pass the consumed tokens as a string to a
value converter. The language developer is responsible for converting the string to a data type.
Terminal Rules
Terminal rules are essentially the same as data type rules when you only consider the interface to the grammar. Internally they are completely different since they will not be processed by the parser but by the lexer. The consequences are quite severe if you want to get a working grammar. But one step after the other: As already mentioned, terminal rules can return the very same things as data type rules can. That is, they yield Strings, ints or other primitives. But since they are handled by the lexer, they are not quite as powerful as data type rules are.
Implementation aspects
The lexer is a pretty dumb component which is generated in a way that weights performance over intuitive behavior. Where the parser generator will produce nice error messages in case of ambiguous data types rules, conflicting terminals are mostly resolved by a first-come-first-served (FCFS, also
FIFO) principle. For terminals it's crucial to get them in the right order. Consider the following terminal rules:
terminal ID: ('a'..'z') ('a'..'z'|'0'..'9')*;
terminal CHARS: ('a'..'z'|'_')+;
The ID rule shall consume something that starts with a lowercase letter and is followed by a letter or number. The CHARS rule is pretty close to that one: It shall match a sequence that contains only lowercase letters or underscores. The problem with these is that the matched sequences are not mutually exclusive. If you take the input
abc as an example, it will be matched as an ID given the two rules above. As soon as you switch the order of the declarations, the sequence
abc will out of a sudden be returned as CHARS. That's one thing that you have to be a aware of if you use terminal rules. But there is more to keep in mind.
Terminal rules are applied without any contextual information which has some interesting implications. The plus: it's easily possible to use the lexer on partial input - entry points can be computed almost trivially. There is no such thing as an entry rule as for the parser. But the disadvantages have to be taken into account, too. The lexer does not care about the characters that are still to come in the input sequence. Everything that matches will be consumed. And not reverted (as of Antlr 3.2). To explain what that means, let's take another example:
terminal DECIMAL: INT '.' INT;
terminal INT: '0'..'9'+;
The rules define decimal numbers in a hypothetical language that also supports method invocation. At a first glance, things seem to work fine:
123 is consumed as an INT where
123.456 will be a decimal. But the devil's in the details. Let's try to parse the string
123.toString(). The lexer will find an INT 123 - so far, so good. Now it sees a dot which is expected by the terminal rule DECIMAL. The lexer will consume the dot and try to read an INT afterwards - which is not present. Now it'll simply fail for the DECIMAL rule but never revert that dot character which was consumed almost by accident. The lexer will create an invalid token sequence for the parser and the method call cannot be read successfully. That's because the lexer simply does not know about things like the expectation in the current parser state. Attempts to define decimals, qualified names or other more complex strings like dates are very error prone and can often be implemented quite easy by means of data type rules:
DECIMAL: INT '.' INT;
terminal INT: '0'..'9'+;
Terminals Considered Harmful?
Data type rules move the decision from the
dumb highly optimized lexer to the parser which has a lot more information at hand (the so called look ahead) to make decisions. So why not use data types everywhere? The simple answer is: performance. The duo of lexer and parser is optimized for a stream of reasonable sized tokens instead of hundreds of single characters. Things will not work out that well with Antlr at run-time if the parser is implemented in a so called scanner-less manner. The rule of thumb here is to use only a few number of terminal rules that can be distinguished easily and put data type rules on top of those. It'll simplify your live as a language developer tremendously.