- Keywords may help to recover from parse errors in a sense that they guide the parser.
- It's recommended to use libraries instead of a hard wired keyword-ish representation for some built in language features.
- Data type rules are the way to go if you want to represent complex syntactical concepts as atomic values in the AST.
Spaces are evil!
terminal ID:
('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')*;
IDs and Keywords
IDs start with a character followed by an arbitrary number of additional characters or digits. And keywords usually look quite similar to an ID. No surprises so far. Now let's assume a keyword definition like 'some' 'input' compared to 'some input'. What happens if the lexer encounters an input sequence 'som ' is the following. It'll start to consume the leading 's' and has not yet decided which token to emit, since it could become a keyword or an identifier. Same for the 'o' and the 'm'. The trailing space is the character where it can finally decide that 'som ' contains two tokens: an identifier and a whitespace token. So far so good.Let the Parser Fail - For Free
Now comes the tricky part since the user continues to type an 'e' after the 'm': 'some '. Again, the lexer starts with the 's' and continues to consume the 'o', 'm' and 'e'. No decision was made yet: it could still be an ID or the start of the keyword 'some input'. The next character is a space, and that's the crucial part here: If grammar contains a keyword 'some input', the space is expected since it is part of the keyword. Now, the lexer has only one valid alternative. After the space it is keen on consuming an 'i', 'n', 'p', 'u' and 't'. Unfortunately, there is no 'i' in the parsed text since the lexer already reached the end of the file.As already mentioned in an earlier post, the lexer will never roll back to the token 'some' in order to create an ID token and a subsequent whitespace. In fact the space character was expected as part of single token so it was safe to consume it. Instead of rolling back and creating two tokens, the lexer will emit an error token which cannot be handled by the parser. Even though the text appeared to be a perfectly valid ID followed by a whitespace, the parser will fail. That's why spaces in keywords are considered harmful.
In contrast, the variant with two split keywords of the grammar works fine. Here, the user is free to apply all sorts of formatting to the two adjacent keywords, any number of spaces, line breaks or even comments can appear between them, are valid and handled well by the parser. If you are concerned about the convenience in the editor - after all, a single keyword with a space seems to be more user friendly in the content assistant - I recommend to tweak that one instead of using an error prone grammar definition.
2 comments:
Shouldn't that be "current parser technologies are evil"?
Spaces are a great way to visually align input.
The problem starts when (lazy?) developers cut corners and create lexers and parsers which aren't smart enough to handle common cases like dangling else or mixing grammars (i.e. when you want to embed SQL into a Java grammar).
Aaron,
visual alignment should not be tied to the semantics of the code. Instead that's a matter of the pretty printer or formatter - and of course it's something that depends on the habits of the developer.
I doubt that there are better solutions for the rather complex problem of mixing languages than a divide and conquer approach so I don't consider that cutting corners but a pragmatic approach to solve most of the common problems.
Regards,
Sebastian
Post a Comment