In today's issue of the Xtext Corner, I want to discuss the library approach and compare it to some hard coded grammar bits and pieces. The question about which path to choose often arises if you want to implement an IDE for an existing language. Most languages use a run-time environment that exposes some implicit API.
Just to name a few examples: Java includes the JDK with all its classes and the virtual machine has a notion of primitive types (as a bonus). JavaScript code usually has access to a DOM including its properties and functions. The DOM is provided by the run-time environment that executes the script. SQL in turn has built-in functions like
max,
avg or
sum. All these things or more or less an integral part of the existing language.
As soon as you start to work on an IDE for such a language, you may feel tempted to wire parts of the environment into the grammar. After all, keywords like
int,
boolean or
double feel quite natural in a Java grammar - at least a first glance. In the long run it often turns out to be a bad idea to wire these things into the grammar definition. The alternative is to use a so called library approach: The information about the language run-time is encoded in an external model that is accessible to the language implementation.
An Example
To use again the Java example (and for the last time in this post): The ultimate goal is to treat types like
java.lang.Object and
java.util.List in the same way as
int or
boolean. Since we did this already for Java as
part of the Xtext core framework, let's use a different, somehow artificial example in the following. Our dummy language supports function calls of which
max,
min and
avg are implicitly available.
The hard-coded approach looks quite simple at first. A simplified view on the things will lead to the conclusion that the parser will automatically check that the invoked functions actually exist, content assist works out of the box and even the coloring of keywords suggests that the three enumerated functions are somehow special.
Not so obvious are the pain-points (which come for free): The documentation for these functions has to be hooked up manually, the complete signatures of them have to be hard-coded, too. The validation has to be aware of the parameters and return types in order to check the conformance with the actual arguments. Things become rather messy beyond the first quickly sketched grammar snippet. And last but not least there is no guarantee that the set of implicit functions is stable forever with each and every version of the run-time. If the language inventor introduces a new function
sum in a subsequent release, everything has to be rebuild and deployed. And you can be sure that the to-be-introduced keyword
sum will cause trouble at least in one of the existing files.
Libraries Instead of Keywords
The library approach seems to be more difficult at first but it pays off quickly. Instead of using hard-coded function names, the grammar uses only a cross reference to the actual function. The function itself is modeled in another resource that is also deployed with the language.

This external definition of the built-in functions can usually follow the same guidelines as custom functions do. But of course they may even use a simpler representation. Such a stub may only define the signature and some documentation comment but not the actual implementation body. It's actualy pretty similar to header files. As long as there is no existing format that can be used transparently, it's often the easiest way to define an Xtext language for a custom stub format. The API description should use the same EPackage as the full implementation of the language. This ensures that the built-ins and the custom functions follow the same rules and all the utilities like the type checker and documentation provider can be used independently from the concrete invoked function.
If there is an existing specification of the implicit features available, that one should be used instead. Creating a model from an existing, processable format is straight forward and it avoids mistakes because there is no redundant declaration of the very same information. In both cases there is a clear separation of concerns: The grammar remains what it should be: a description of the concrete syntax and not something that is tight to the run-time. The API specification is concise and easy to grasp, too. And in case an existing format can be used for that purpose, it's likely that the language users are already familiar with that format.
Wrap Up
You should always consider to use external descriptions or header stubs of the environment. A grammar that is tightly coupled to a particular version or API is quite error-prone and fragile. Any evolution of the run-time will lead to grammar changes which will in turn lead to broken existing models (that's a promise). Last but not least, the effort for a seamless integration of built-in and custom functions for the end-user exceeds the efforts for a clean separation of concerns by far.
A very sophisticated implementation of this approach, can be explored in the
Geppetto
repository at GitHub. Geppetto uses puppet files and ruby libraries as
the target platform, parses them and puts them onto the scope of the
project files. This example underlines another advantage of the library
approach: It is possible to use a configurable environment. The APIs may
be different from version to version and the concrete variant can be
chosen by the user. This would never be possible with a hard-wired set
of built-ins.