Using ANTLR (v4) to lex/parse custom file formats

Hello everyone!

I'm the ANTLR creator and am trying my hand at creating an ANTLR v4 plug-in for Intellij. My co-author, Sam Harwell, has created one for netbeans, but Intellij is my preference. Ultimately, I would like to build plug-ins for languages using ANTLR in intellij, rather than having to convert ANTLR grammars to the PEG-based BNF grammars used in the grammarkit. For one, I'm the ANTLR guy, and two there is no point in maintaining two language descriptions. It seems to me that the community could be well served by an ability to specify lexers and parsers in ANTLR's metalanguage instead of building parsers by hand or using the grammarkit. The grammarkit might be fine for syntax highlighting, but if you're actually going to do anything with the language you need a compiler or interpreter which means you need a real parser. That means creating a grammar in something other than grammarkit, even if it's for the simple need to create the proper compiler data structure instead of what intellij needs. (imagine incorporating a random C compiler; it has a grammar and then you would need to code dup it into grammarkit for syntax highlighting.)

I've been going through the tutorials, reading source code, and reading through these discussions.  It's fairly slow going trying to build my plug-in because of the  scarcity of written documentation and documentation within the library code, but some recent videos and some of the example plug-ins are illuminating. They're well worth reading through.  I appreciate the efforts you've gone through to create the documentation that is available.

I have a number of questions for the Galactic overlords from jetbrains:

* Would you consider it valuable if I made an adapter so that plug-ins could use ANTLR v4 grammars out-of-the-box? There was some discussion that it "is not recommended to use ANTLR for language plugins because most ANTLR grammars don't handle error recovery well [2008]" which is probably a bit in error, particularly with the latest v4. Possibly an earlier versions consumed too much but it never gives up and exits the parser. Building great error recovery by hand is notoriously difficult. Because grammarkit is PEG-based, it essentially has no error recovery. PEGs can only discover errors after the entire input has been matched; then they jump to the deepest position in the input and claim that's where the syntax error is. [xtext guys abandoned PEGs and came back to ANTLR after they experienced PEG "error recovery".] ;) I'm happy to discuss how to integrate ANTLR's error recovery into Intellij so that it performs as you would like.  I'd like to build a commandline option on ANTLR that generates a basic intellij plug-in automatically from a grammar so I think this could contribute a lot to plug-in development.

* What are the minimum requirements of a parser? (The required lexer implementation seems reasonably straightforward from looking at the source code, but I've not actually tried a custom lexer.) The sample handwritten parsers seem to create PSI nodes via mark() / marker.done() at the start/end of parsing functions.  From what I can tell, integrating ANTLR's generated recursive descent parsers would require altering ANTLR code generation.  That's no big deal, but perhaps the way to integrate ANTLR would be avoid integrating the parser itself and wrap or convert its automatically-generated parse trees into PSI elements.  Does this approach makes sense? Netbean's got a very simple lexer/parser interface and cramming ANTLR in there is no big deal; it's just for syntax highlighting. Is their similar lightweight approach where I can simply indicate where the syntax errors are? I.e., w/o having to build your special tree? In the end, do I need a PSI tree to access all of the cool widgets like source code navigators and search dialogs? Certainly for re-factoring and so on I would need to use your trees...

Sorry for the long initial post, but I wanted to start a discussion and learn how to proceed at the same time :) If there is desire from the community and I can get my roadblocks cleared, I'd be happy to keep working on this.

Thanks for your time,
Terence

8 comments

Hi Terence,

First of all, I don't quite get your point about GrammarKit being good just for syntax highlighting. It lets you generate a completely fine parser, and we have a number of plugins (Dart, for example) that use GrammarKit-based parsers and are completely full-featured.

To answer your main question - yes, we would consider it extremely valuable to enable using ANTLR grammars in IntelliJ IDEA plugins, provided that a PSI tree can be generated based on the grammar. As for the exact strategy, both of the options you've mentioned are possible. Modifying ANTLR's generator to emit PsiBuilder.Marker calls would be the best solution from the performance point of view, but if you find it too cumbersome, you can simply generate the AST using the standard code generator and then walk over it to convert it to PsiBuilder markers (there are plugins out there that use this approach).

Please don't hesistate to ask if you have any further questions.

1

Hi Dmitry, sorry.  I should have been more clear. What I meant was I'm sure grammarkit is good for parsing but when you need to integrate a parser into the compiler, translator, interpreter I am guessing it's not the best parser generator.  Otherwise, It would have started to take market share from ANTLR ;). I'm sure it's excellent for validating syntax and so on within the IDE. I'm also curious to know how fast it is, given that Rats! had to do some extreme optimizations to get it fast. For my most recent paper, we did a quick check with (not our best) Java parser. It does about 325,000 lines/second, parsing the entire JDK 7 lib in 7s. (not that speed is the most important thing, but for responsiveness in an IDE, I would imagine it's important; I'm guessing that you have not rewritten your Java parser to use GrammarKit) But as you say that's not the main point here and I don't in any way mean to criticize grammarkit. No doubt it is the best way for most people to integrate language support. My only hope is that, together, we can produce an automated grammar system that works well inside and outside of intellij.

Ok, great! As long as I know I will have the interest of jetbrains and the support necessary to answer questions, I will forge ahead. As I said, my goals are:

1. Learn how to build plug-ins for little languages
2. A simple ANTLR v4 plug-in
3. A -intellij option on ANTLR that will generate parsers/lexers for intellij that work out-of-the-box. As you suggest, a simple tweak to the codegeneration might do it. The problem might be forcing all of the support code to generate PSI trees not ANTLR trees implementing ParseTree. If the PSI mechanism is all purely interface based, as I think it might be, it might work out.
3a. Possibly even generating the necessary plug-in software. There should be no reason I can't generate a basic plug-in automatically

Thanks very much, Dmitry. You are very responsive and helpful. Thank you for that video posted on the blog.

I will get back to you with my primary questions soon. As a first step, I will try to create the core of a trivial handbuilt parser/lexer. Maybe even some text around it that you guys could use as a start on another bit of documentation.

Ter

0

A quick question on timing. I'm all the way out in california and so it looks like you're heading home soon just as I wake up! dang. How late do you usually work (your time)? Are there other tech folks answering questions from other time zones?

thanks,
Ter

1

As for me personally, my working hours are 11-noon to 8-9 PM European time. As for the others, the entire dev team of IntelliJ IDEA is located either in Munich or in St.Petersburg, Russia.

1

Excellent. I can expect potential responses til noon ish my time then.
Ter

1

Hey Parr, any updates on this? I have generated lexers and parsers using ANTLR4 for a custom language, but not sure how to hook those to intellij to use them.

1

Hi. I have been working hard on the plug-in and have learned a great deal about integrating parsers. I found it challenging to incorporate an ANTLRv4-generated parser into the PSI tree and parsing mechanism built-in but got it to work. Editors that view .g4 files use this mechanism, however, the live preview mechanism uses just a blank editor and ANTLR's parser interpreter. From there, I add my own error highlighting and that sort of thing as it was very easy. Naturally, it does not have all of the fancy automatic re-factoring and things like that available to editors that fit within the infrastructure. I literally could not get the parser interpreter to play nicely with the PSI infrastructure. It is not the fault of the infrastructure, but rather two completely different approaches that were never meant to work together. More like oil and water than peanut butter and jelly :)

The upshot of this is that I will be able to provide a small framework that lets you integrate and ANTLR grammar either with the interpreter or with the generated code. Ultimately my goal is to simply have a button in the plug-ins that says: make me an intellij project for this grammar that you see before you. It should be easy but it's not my first priority right now.

I really like the intellij infrastructure; I'm getting good traction now and building sophisticated stuff.  Wait until you see the parser profiler I'm about to release!

My suggestion is that you take a look at the plug-in source code

https://github.com/antlr/intellij-plugin-v4

In particular, look at the InputPanel and PreviewPanel and ParsingUtils classes. I hope this helps in the meantime.

Ter

2

Please sign in to leave a comment.