Wednesday, October 31, 2012

Xtext Corner #5 - Backtracking vs Syntactic Predicates

The Xtext grammar language allows to create a working parser in almost no-time. Its concise notation to describe the concrete syntax and the mapping to an object model is giving quite a jump start if you want to create a language. Nevertheless it's also quite easy to get into some trouble. Xtext uses Antlr 3.2 as the underlying parser technology and we try really hard to hide the complexity and peculiarity of Antlr. Unfortunately that's not possible in all cases. From time to time Antlr will report ambiguities in the grammar definition with a charming message like this:
warning(200): Decision can match input such as "{EOF, RULE_ID, '('}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
The parse generator basically complains about an ambiguous grammar. At some point in the syntax description it cannot decide which path to follow for a given input sequence. It's rather obvious that the warning message is not really helpful. Neither is there any chance to find the correct line that caused the error (which is not a problem of Antlr but cause by the translation of Xtext to Antlr) nor is it easily possible to spot the concrete decision that the parser generator complains about. The worst about this is that it's not really a warning either. What the parser generator basically did is the following: It removed a possible path from the grammar description. It will always choose the one remaining path for that particular situation. Which could by chance be the one that you'd expect. But it could also be the wrong path.


Fortunately there is a tool that helps to identify the problem: AntlrWorks allows to take a look at the grammar and visualize all the problems that it has graphically. It's still far from trivial to find the root cause of a problem but better than nothing. Make sure you pick the version 3.2 from download section if you want to give it a try.

Now you may wonder how you should handle the cases that are ambiguous by definition and by intention. You could of course enable backtracking in your language and afterwards everything appears to be fine. However, you can think about backtracking as a wildcard for Antlr to remove alternatives from your grammar everywhere where it spots an ambiguity. This will shadow the real problems in the grammar that may be introduced due to subsequent changes, a refactoring or new language features. That's why I strongly recommend to go for the hard way and analyze the root cause for the warnings. As soon as you found the actual decision that the parser generator complained about, you can use a syntactic predicate to fix that locally. Now you are in control on which alternative to remove and which path to follow. I think that makes perfect sense to be in charge in those cases.


But the shadowing of errors in the generator is only one drawback of backtracking. It will furthermore lead to surprising messages at run-time. If you consider the following snippet it's easy to see that the right operand of the binary operation is missing.
The parser will correctly report something along the lines of
mismatched input '}' expecting RULE_INT
Unfortunately it does so on a totally unexpected location. If you enabled backtracking and the algorithm decides that the function declaration is not complete - it fails to read a valid function body - the parsing will roll back to the start of the function and put the most specific error message on that token. You'll see an error marker under the keyword function. However, it would be more intuitive to have that error on the binary operation. At least that's what I would expect, wouldn't you?

Syntactic Predicates

Nevertheless it's not always possible to write an unambiguous grammar. There are some common patterns that are undecidable by definition. The most famous one is the Dangling else. If a language allows nested if-else constructs, it's not definite where a subsequent else keyword may belong to. Consider the following Java snippets which only differ in formatting:

The semantics of both code snippets should be independent from the formatting. Nevertheless it's ambiguous for the parser in the same way as a reader might be confused by an inconsistent indentation. Therefore you have to force the parser into one concrete direction in order to disambiguate the grammar. A syntactic predicate has to be added.

The => operator forces the parser to go a certain path if the input sequence would allow two or more possible decisions. It can be read as If you see these tokens, go this way. It's even possible to use alternatives or groups of elements as the criteria. Only the UnorderedGroup is prohibited in predicates.

In this example, the parser shall follow the given path if it can look ahead to a sequence like (or more abstract ID . ID =).

Implementation Detail

One thing is important to note. Syntactic predicates in Xtext are different from the plain Antlr predicates. In the Xtext grammar language it's only possible to use a complete or a partial sequence of production tokens as the predicate where Antlr allows to use arbitrary tokens that seem to be independent from the actual rule content. Here Antlrs approach appears to be more powerful. But actually that's only at a first glance. Firstly Xtext's variant is easier to use since you don't have to repeat parts of your grammar manually. And secondly it's the framework that does the heavy lifting: The syntactic predicates in Xtext are automatically propagated to the right places which you'd have to do manually otherwise. Just insert it at the spot that you identified with AntlrWorks and you're done.


Damedos said...

Once more an excellent post! Thanks Sebastian!

LevanenG said...

Thank you! The operator '=>' saves my life

aliyaa said...

The best ever sentence structure checker are provided here. Good grammar and good sentence structure, plus a compelling topic make up the best article.