Sebastian Zarnekow's Blog: November 2012

Tuesday, November 27, 2012

Performance Is Not Obvious

Recently there was a post in the Xtext forum about the runtime performance of a particular function in the Xtext code base:

Ed Merks suggested to rewrite the method to a loop iteration instead of a recursive function and to save one invocation of the method eContainer such as the new implementation shall become at "least twice as fast."

I really liked the new form since is much easier to debug and to read and from that point a the change is definitely worth it. However, as I recently did a deep dive into the runtime behavior of the JVM, I doubted that the change would have to much impact on the actual performance of that method. So I took the time and sketched a small caliper benchmark in order to double check my intuition.

As it turns out, the refactored variant is approximately 5 to 20% faster for the rare cases where a lot of objects have to be skipped before the result was found and takes the same time for the common case where the requested instance is found immediately. So it's not even close to the expected improvements. But what's the take-away?

Before applying optimizations it's worthy to measure the impact. It may be intuitive to assume that cutting the number of method invocation down to a fraction of the original implementation - after all it was a recursive implementation before - saves a lot of time but actually the JVM is quite powerful and mature at inlining and producing optimized native code. So I can only repeat the conclusion of my talk about Java Performance (which I proposed for EclipseCon Boston, too):

[..] Write Readable and Clear Code. [..] (David Keenan)
[..] slavishly follow a principle of simple, clear coding that avoids clever optimizations [..] (Caliper FAQ)
Performance advice has a short shelf-life  (B. Goetz)

From that point of view, the refactored implementation is definitely worth it, even though there is no big impact on the runtime behavior of the method.

Wednesday, November 14, 2012

Xtext Corner #9 - About Keywords, Again

In the last weeks, I compiled some information about proper usage of keywords and generally about terminals in Xtext:

Keywords may help to recover from parse errors in a sense that they guide the parser.
It's recommended to use libraries instead of a hard wired keyword-ish representation for some built in language features.
Data type rules are the way to go if you want to represent complex syntactical concepts as atomic values in the AST.

In addition to these hints, there is one particular issue that arises quite often in the Xtext forum. People often wonder why their grammar does not work properly for some input files but perfectly well for others. What it boils down to in many of these examples is this:

Spaces are evil!

This seems to be a bold statement but let me explain why I think that keywords should never contain space characters. I'm assuming you use the default terminals but actually this fits for almost all terminal rules that I've seen so far. There is usually a concept of an ID which is defined similar to this:

terminal ID:
('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')*;

IDs and Keywords

IDs start with a character followed by an arbitrary number of additional characters or digits. And keywords usually look quite similar to an ID. No surprises so far. Now let's assume a keyword definition like 'some' 'input' compared to 'some input'. What happens if the lexer encounters an input sequence 'som ' is the following. It'll start to consume the leading 's' and has not yet decided which token to emit, since it could become a keyword or an identifier. Same for the 'o' and the 'm'. The trailing space is the character where it can finally decide that 'som ' contains two tokens: an identifier and a whitespace token. So far so good.

Let the Parser Fail - For Free

Now comes the tricky part since the user continues to type an 'e' after the 'm': 'some '. Again, the lexer starts with the 's' and continues to consume the 'o', 'm' and 'e'. No decision was made yet: it could still be an ID or the start of the keyword 'some input'. The next character is a space, and that's the crucial part here: If grammar contains a keyword 'some input', the space is expected since it is part of the keyword. Now, the lexer has only one valid alternative. After the space it is keen on consuming an 'i', 'n', 'p', 'u' and 't'. Unfortunately, there is no 'i' in the parsed text since the lexer already reached the end of the file.

As already mentioned in an earlier post, the lexer will never roll back to the token 'some' in order to create an ID token and a subsequent whitespace. In fact the space character was expected as part of single token so it was safe to consume it. Instead of rolling back and creating two tokens, the lexer will emit an error token which cannot be handled by the parser. Even though the text appeared to be a perfectly valid ID followed by a whitespace, the parser will fail. That's why spaces in keywords are considered harmful.

In contrast, the variant with two split keywords of the grammar works fine. Here, the user is free to apply all sorts of formatting to the two adjacent keywords, any number of spaces, line breaks or even comments can appear between them, are valid and handled well by the parser. If you are concerned about the convenience in the editor - after all, a single keyword with a space seems to be more user friendly in the content assistant - I recommend to tweak that one instead of using an error prone grammar definition.

Thursday, November 8, 2012

Xtext Corner #8 - Libraries Are Key

In today's issue of the Xtext Corner, I want to discuss the library approach and compare it to some hard coded grammar bits and pieces. The question about which path to choose often arises if you want to implement an IDE for an existing language. Most languages use a run-time environment that exposes some implicit API.

Just to name a few examples: Java includes the JDK with all its classes and the virtual machine has a notion of primitive types (as a bonus). JavaScript code usually has access to a DOM including its properties and functions. The DOM is provided by the run-time environment that executes the script. SQL in turn has built-in functions like max, avg or sum. All these things or more or less an integral part of the existing language.

As soon as you start to work on an IDE for such a language, you may feel tempted to wire parts of the environment into the grammar. After all, keywords like int, boolean or double feel quite natural in a Java grammar - at least a first glance. In the long run it often turns out to be a bad idea to wire these things into the grammar definition. The alternative is to use a so called library approach: The information about the language run-time is encoded in an external model that is accessible to the language implementation.

An Example

To use again the Java example (and for the last time in this post): The ultimate goal is to treat types like java.lang.Object and java.util.List in the same way as int or boolean. Since we did this already for Java as part of the Xtext core framework, let's use a different, somehow artificial example in the following. Our dummy language supports function calls of which max, min and avg are implicitly available.

The hard-coded approach looks quite simple at first. A simplified view on the things will lead to the conclusion that the parser will automatically check that the invoked functions actually exist, content assist works out of the box and even the coloring of keywords suggests that the three enumerated functions are somehow special.

Not so obvious are the pain-points (which come for free): The documentation for these functions has to be hooked up manually, the complete signatures of them have to be hard-coded, too. The validation has to be aware of the parameters and return types in order to check the conformance with the actual arguments. Things become rather messy beyond the first quickly sketched grammar snippet. And last but not least there is no guarantee that the set of implicit functions is stable forever with each and every version of the run-time. If the language inventor introduces a new function sum in a subsequent release, everything has to be rebuild and deployed. And you can be sure that the to-be-introduced keyword sum will cause trouble at least in one of the existing files.

Libraries Instead of Keywords

The library approach seems to be more difficult at first but it pays off quickly. Instead of using hard-coded function names, the grammar uses only a cross reference to the actual function. The function itself is modeled in another resource that is also deployed with the language.

This external definition of the built-in functions can usually follow the same guidelines as custom functions do. But of course they may even use a simpler representation. Such a stub may only define the signature and some documentation comment but not the actual implementation body. It's actualy pretty similar to header files. As long as there is no existing format that can be used transparently, it's often the easiest way to define an Xtext language for a custom stub format. The API description should use the same EPackage as the full implementation of the language. This ensures that the built-ins and the custom functions follow the same rules and all the utilities like the type checker and documentation provider can be used independently from the concrete invoked function.

If there is an existing specification of the implicit features available, that one should be used instead. Creating a model from an existing, processable format is straight forward and it avoids mistakes because there is no redundant declaration of the very same information. In both cases there is a clear separation of concerns: The grammar remains what it should be: a description of the concrete syntax and not something that is tight to the run-time. The API specification is concise and easy to grasp, too. And in case an existing format can be used for that purpose, it's likely that the language users are already familiar with that format.

Wrap Up

You should always consider to use external descriptions or header stubs of the environment. A grammar that is tightly coupled to a particular version or API is quite error-prone and fragile. Any evolution of the run-time will lead to grammar changes which will in turn lead to broken existing models (that's a promise). Last but not least, the effort for a seamless integration of built-in and custom functions for the end-user exceeds the efforts for a clean separation of concerns by far.

A very sophisticated implementation of this approach, can be explored in the Geppetto repository at GitHub. Geppetto uses puppet files and ruby libraries as the target platform, parses them and puts them onto the scope of the project files. This example underlines another advantage of the library approach: It is possible to use a configurable environment. The APIs may be different from version to version and the concrete variant can be chosen by the user. This would never be possible with a hard-wired set of built-ins.

Tuesday, November 6, 2012

Xtext Corner #7 - Parser Error Recovery

A crucial feature of the parser in Xtext is the ability to recover from errors: The parser may not fail on the first erroneous token in an invalid input but should continue after that token. In fact it should continue to the end of the document and yield an AST that is incomplete but contains as many nodes as possible. This feature of the parser is called error recovery: The ability to consume input text that is not conform to the grammar.

Error recovery is obviously necessary in an interactive environment like an editor since most of the time the input will actually be invalid. As soon as the user starts to type, the document may be broken in all sorts of ways. The user does not really care whether his actions are in line with some internal AST-structure or grammar rules. Copy and paste, duplicate lines, remove portions of the file by using toggle comment or just plain insertion of characters into the editor - none of these operations should cause the parser to fail utterly. After all, content assist, the outline and code navigation are expected to work for broken documents, too - at least to some extend.

Recovery strategies

The Antlr parser that is generated from an Xtext grammar supports different recovery strategies. If the parser encounters an invalid input, it'll basically perform one of the following operations:

Skip the invalid terminal symbol
If the current token is unexpected, the following terminal is considered. If that one matches the current expectation, the invalid token is skipped and flagged with an error.
Inject the missing token
The seen terminal is not valid at the current position in the parsing process but would be expected as the subsequent token. In that case, the parser might inject the missing token. It skips the current expectation and continues with the next step in the parsing. The information about the missing token is annotated on the current input symbol.
Pop the parsing stack
If the input is broken in a way that does not allow the parser to skip or inject a single token, it'll start to consume the following terminal symbols until it sees a token that is somehow unique in the expectation according to the current parsing stack. The parser will pop the stack and do a re-spawn on that element. This may happen in the rare case that the input is almost completely messed up.
Fail
The mentioned recovery strategies may fail due to the structure of the grammar or the concrete error situation in the input. In that case parsing will be aborted. No AST for subsequent input in the document will be produced.

Helping the Parser

There exist several things that you should watch out for if you experience poor error recovery in your language. First and foremost it may be the absence of keywords in the grammar. Keywords are often the only anchor that the parser can use to identify proper recovery points. If you feel tempted to write an overly smart grammar without any keywords because it should look and feel like natural language, you should really reconsider your approach. Even though I don't want to encourage a keyword-hell, keywords are somehow convenient if they are used properly. And please note that things like curly braces, parentheses or other symbols with only one character are as good as a keywords as other, longer sequences - at least from the parsers perspective. So to give a very popular example: Instead of using indentation to describe the structure of your language (similar to Python), using a c-style notations may save you a lot of effort with the grammar itself and provide a better user experience when editing code. And keywords also serve as a nice visual anchor in an editor so users will have an easier time when reading code in your language.

A second strategy to improve the behavior of the parser and the chance for nice error recovery is the auto-edit feature. It may have some flaws but it's quite essential for a good user experience. The most important aspect here is the insertion of closing quotes for strings and comments. As soon as you have an input sequence that is not only broken for the parser but lets even the lexer choke, you are basically screwed. Therefore multiline comments and strings are automatically closed as soon as you open them. If you use custom terminal rules, you should really consider to look for unmatched characters that should be inserted in pairs according to the lexer definition. The rule basically applies for paired parentheses, too. Even though the concrete auto-edit features may still need some fine tuning to not get in the way of the user, they already greatly improve the error recovery of the parser.

Friday, November 2, 2012

Xtext Corner #6 - Data Types, Terminals, Why Should I Care?

Issue #6 of the Xtext Corner is about some particularities of the parsing process in Xtext. As I already mentioned a few times in the past, Xtext uses Antlr under the hood in order to do the pure parsing. This is basically a two step process: At first, the input sequence of characters is split into tokens (often referred to as terminals) by a component that is called the lexer. The second step is to process the resulting list of tokens. The actual parser is responsible for that stage. It will create the abstract syntax tree from the token stream.

This divide-and-conquer approach is mostly called parsing altogether so the distinction between lexing and parsing is quite encapsulated. Nevertheless, the Xtext grammar definition honors both aspects: It is possible to define (parser) rules that are processed by the parser and it is also possible to define terminal rules. Those will be handled in the lexer. So the when should I use parser rules and when should I use terminal rules?

Production Rules also: Parser Rules

The obvious case and just for the sake of completeness: Production rules will yield an instance in the abstract syntax tree. These can only be implemented by the parser thus there is no question whether to use terminals instead. Production rules are the most common rules in almost every Xtext grammar.

Data Type Rules

Those are a completely different thing even though they are handled by the parser, too: Where ordinary parser rules produce instances of EClasses, data type rules will return data types (you did not guess that, did you?). Data types in the sense of Xtext and its usage of the Eclipse Modeling Framework are basically primitive Java types, Strings or other commons like BigDecimal or enums. The parser will not create those on its own but rather pass the consumed tokens as a string to a value converter. The language developer is responsible for converting the string to a data type.

Terminal Rules

Terminal rules are essentially the same as data type rules when you only consider the interface to the grammar. Internally they are completely different since they will not be processed by the parser but by the lexer. The consequences are quite severe if you want to get a working grammar. But one step after the other: As already mentioned, terminal rules can return the very same things as data type rules can. That is, they yield Strings, ints or other primitives. But since they are handled by the lexer, they are not quite as powerful as data type rules are.

Implementation aspects

The lexer is a pretty dumb component which is generated in a way that weights performance over intuitive behavior. Where the parser generator will produce nice error messages in case of ambiguous data types rules, conflicting terminals are mostly resolved by a first-come-first-served (FCFS, also FIFO) principle. For terminals it's crucial to get them in the right order. Consider the following terminal rules:

terminal ID: ('a'..'z') ('a'..'z'|'0'..'9')*;
terminal CHARS: ('a'..'z'|'_')+;

The ID rule shall consume something that starts with a lowercase letter and is followed by a letter or number. The CHARS rule is pretty close to that one: It shall match a sequence that contains only lowercase letters or underscores. The problem with these is that the matched sequences are not mutually exclusive. If you take the input abc as an example, it will be matched as an ID given the two rules above. As soon as you switch the order of the declarations, the sequence abc will out of a sudden be returned as CHARS. That's one thing that you have to be a aware of if you use terminal rules. But there is more to keep in mind.

Terminal rules are applied without any contextual information which has some interesting implications. The plus: it's easily possible to use the lexer on partial input - entry points can be computed almost trivially. There is no such thing as an entry rule as for the parser. But the disadvantages have to be taken into account, too. The lexer does not care about the characters that are still to come in the input sequence. Everything that matches will be consumed. And not reverted (as of Antlr 3.2). To explain what that means, let's take another example:

terminal DECIMAL: INT '.' INT;
terminal INT: '0'..'9'+;

The rules define decimal numbers in a hypothetical language that also supports method invocation. At a first glance, things seem to work fine: 123 is consumed as an INT where 123.456 will be a decimal. But the devil's in the details. Let's try to parse the string 123.toString(). The lexer will find an INT 123 - so far, so good. Now it sees a dot which is expected by the terminal rule DECIMAL. The lexer will consume the dot and try to read an INT afterwards - which is not present. Now it'll simply fail for the DECIMAL rule but never revert that dot character which was consumed almost by accident. The lexer will create an invalid token sequence for the parser and the method call cannot be read successfully. That's because the lexer simply does not know about things like the expectation in the current parser state. Attempts to define decimals, qualified names or other more complex strings like dates are very error prone and can often be implemented quite easy by means of data type rules:

DECIMAL: INT '.' INT;
terminal INT: '0'..'9'+;

Terminals Considered Harmful?

Data type rules move the decision from the ~~dumb~~ highly optimized lexer to the parser which has a lot more information at hand (the so called look ahead) to make decisions. So why not use data types everywhere? The simple answer is: performance. The duo of lexer and parser is optimized for a stream of reasonable sized tokens instead of hundreds of single characters. Things will not work out that well with Antlr at run-time if the parser is implemented in a so called scanner-less manner. The rule of thumb here is to use only a few number of terminal rules that can be distinguished easily and put data type rules on top of those. It'll simplify your live as a language developer tremendously.

Thursday, November 1, 2012

JUGFFM: A Scenic View, Ebblewoi and Very Nice People

Yesterday I had the opportunity to give a presentation about Xtend at the JUG Frankfurt. I really enjoyed it since the audience had a lot of very good questions and quite some interesting discussion unfolded from those during the talk and thereafter. Many thanks to Alex who organized the event.

The JUGF Stammtisch took place in the German National Library which is such an amazing location. We were in a room on the upper floor and the nightly view on the skyline of Frankfurt was almost paralyzing - I even forgot to take a picture... For the informal Stammtisch after the talk we changed location to a secret Franconian Apfelwein Schenke whose coordinates may not be disclosed. According to the locals, it's one of the last resorts in Frankfurt that's still rather free from tourists (except for the Kieler guy who will probably never manage to pronounce Ebblewoi correctly).

Dank an @szarnekow für den interessanten Xtend-Vortrag bei der @jugffm. Laßt euch das Apfelkompott schmecken... ;-)
— Hameister (@Hameiste) October 31, 2012

In the pub, the discussions continued over cutlet with traditional Green Sauce, typical Franconian cider or beer, since only Hessian stomachs can handle proper amounts of Ebblewoi. Unfortunately I had to leave at 10pm since I had to catch the receptionist in the hotel (thanks for the short briefing before I left, Apple maps indeed tried to play tricks on me...).

To put a long story short: The JUGF is a really nice crowd and I enjoyed the company a lot. Their next meeting is already on 07 Nov 2012. If you are in Frankfurt next week, make sure to stop by if you want to discuss DevOps topics.