Tuesday, November 6, 2012

Xtext Corner #7 - Parser Error Recovery

A crucial feature of the parser in Xtext is the ability to recover from errors: The parser may not fail on the first erroneous token in an invalid input but should continue after that token. In fact it should continue to the end of the document and yield an AST that is incomplete but contains as many nodes as possible. This feature of the parser is called error recovery: The ability to consume input text that is not conform to the grammar.

Error recovery is obviously necessary in an interactive environment like an editor since most of the time the input will actually be invalid. As soon as the user starts to type, the document may be broken in all sorts of ways. The user does not really care whether his actions are in line with some internal AST-structure or grammar rules. Copy and paste, duplicate lines, remove portions of the file by using toggle comment or just plain insertion of characters into the editor - none of these operations should cause the parser to fail utterly. After all, content assist, the outline and code navigation are expected to work for broken documents, too - at least to some extend.

Recovery strategies 

The Antlr parser that is generated from an Xtext grammar supports different recovery strategies. If the parser encounters an invalid input, it'll basically perform one of the following operations:
  1. Skip the invalid terminal symbol
    If the current token is unexpected, the following terminal is considered. If that one matches the current expectation, the invalid token is skipped and flagged with an error.
  2. Inject the missing token
    The seen terminal is not valid at the current position in the parsing process but would be expected as the subsequent token. In that case, the parser might inject the missing token. It skips the current expectation and continues with the next step in the parsing. The information about the missing token is annotated on the current input symbol.
  3. Pop the parsing stack
    If the input is broken in a way that does not allow the parser to skip or inject a single token, it'll start to consume the following terminal symbols until it sees a token that is somehow unique in the expectation according to the current parsing stack. The parser will pop the stack and do a re-spawn on that element. This may happen in the rare case that the input is almost completely messed up.
  4. Fail
    The mentioned recovery strategies may fail due to the structure of the grammar or the concrete error situation in the input. In that case parsing will be aborted. No AST for subsequent input in the document will be produced.

Helping the Parser

There exist several things that you should watch out for if you experience poor error recovery in your language. First and foremost it may be the absence of keywords in the grammar. Keywords are often the only anchor that the parser can use to identify proper recovery points. If you feel tempted to write an overly smart grammar without any keywords because it should look and feel like natural language, you should really reconsider your approach. Even though I don't want to encourage a keyword-hell, keywords are somehow convenient if they are used properly. And please note that things like curly braces, parentheses or other symbols with only one character are as good as a keywords as other, longer sequences - at least from the parsers perspective. So to give a very popular example: Instead of using indentation to describe the structure of your language (similar to Python), using a c-style notations may save you a lot of effort with the grammar itself and provide a better user experience when editing code. And keywords also serve as a nice visual anchor in an editor so users will have an easier time when reading code in your language.

A second strategy to improve the behavior of the parser and the chance for nice error recovery is the auto-edit feature. It may have some flaws but it's quite essential for a good user experience. The most important aspect here is the insertion of closing quotes for strings and comments. As soon as you have an input sequence that is not only broken for the parser but lets even the lexer choke, you are basically screwed. Therefore multiline comments and strings are automatically closed as soon as you open them. If you use custom terminal rules, you should really consider to look for unmatched characters that should be inserted in pairs according to the lexer definition. The rule basically applies for paired parentheses, too. Even though the concrete auto-edit features may still need some fine tuning to not get in the way of the user, they already greatly improve the error recovery of the parser.

1 comment:

David Pizarro said...

Do you have any tips for forcing particular error recovery for a specific rule?

I have a poorly designed language where if I can't identify a segment properly, I want to pop the parsing stack until I reach a specific token. This is so I can continue parsing the rest of the file from there.

e.g.
Model hidden(SYM_SPACE, SYM_TAB):
elem+=Option? =>(SYM_EOL elem+=Option?)*
;

fails to parse further "Option" elements after an unexpected token is present after the SYM_EOL.