Virtual/phantom tokens in lexer output

Permanently deleted user

Created November 27, 2012 03:22

I've been working on a language plugin and once I started to actually try to use it myself, I noticed there seems to be a big problem somewhere between the lexer and the parser. Specifically, I'm working on a Haskell plugin and that language has an interesting aspect in that braces are required for program correctness, but the braces are not usually written by programmers themselves. Instead, language files are effectively preprocessed according to an "offside" or "layout" rule that inserts braces in the appropriate spots, based on newlines/indentation after certain keywords that require braces. I implemented this in my plugin with what is effectively a facade that sits between the lexer and its accessors so that when the class using the lexer asks for the next token, it might actually get a "virtual" brace if that's what the layout rule requires at that point in the source file.

This all works great for unit tests and every crazy source file I've thrown at it, at least for the initial parse. The trouble starts when I edit an existing file; I'll quickly see my PSI tree fall apart in the viewer after the edit point and see an error before the edit point saying, for example, "} expected" signifying, obviously, that the parser (which knows nothing about the layout rule) needs a closing brace but didn't get one from the lexer. While I can't rule out some insidious bug in my own code, I suspect this actually has something to do with the "incremental re-lexing" that the IntelliJ framework appears to perform. Given the natural of the layout rule (or at least my implementation thereof), re-lexing a small portion of the file without the surrounding context seems doomed to fail.

So first question: Are there any general guidelines around virtual tokens like this? Even it doesn't my specific situation, such guidelines might lead me in the right direction. Beyond that, are the any suggestions for how I might go about fixing this? I can think of a few to try:

1. Prevent incremental re-lexing either by hackery or by configuration. Might be willing to consider this path temporarily if it let me continue putting the plugin through its paces and designing features but the performance implications are likely to be too dire to seriously consider this option.

2. The lexer interface defines getting the current lexer state and a method to return to that state. Presumably IntelliJ keeps track of this state number and its association to tokens/text ranges so that it can restore the state when restarting the lexer in the middle of the file. If that is indeed how it works, then in theory some sort of lexer state manager could keep track of the layout stack at every point in the file and then lookup the right stack context for a given state number. This seems pretty nightmarish and fraught with error and probably a massive overload of the meaning of the lexer state which, from the few examples I've been able to find, seems more geared towards "I'm inside a string right now".

3. Some other nifty lexer feature that solves all my problems which I have been too blind to see yet.

The other possiblity I can see is learning more about how the lexer/parser interact. Specifically, when and how is the PSI tree refreshed? If the parse tree was always a full regen after the editor input was re-lexed, it would seem very feasible to put the layout processing shim in front of the parser as a pre-processing step.

Thanks for any suggestions and for taking the time to read this long-winded post.

9 comments

Jon Akhtar

Created November 28, 2012 16:59

I don't know for sure that this is the right answer, but have you considered using the TokenFilter support in PsiBuilder?

Oh, and it parses the whole file unless you created a parser that supports incremental reparsing, or lazy parsing. It just merges the changes from the reparse into your existing psi tree.

Permanently deleted user

Created November 28, 2012 17:08

Been doing some more digging and it seems like the virtual tokens as currently implemented just aren't going to work with IntelliJ. There are at least several places in the OpenAPI that explicitly check and throw for cases where the current token start position and state match the previous token start and state. I also know from previous experiments that other parts get very angry when you mess with the lexemes in other ways. So I'm going to try moving the layout processing into the parser.

Permanently deleted user

Created November 28, 2012 17:24

Interesting idea, Jon, thanks. Just looked and I think the filter you're referring to is the ITokenTypeRemapper? At the moment I can't think of when that particular interface would be useful but I don't care because your suggestion gave me a different idea: Why not encode the virtual token information directly into my IElementType implementation? Thanks a lot! I'll report back my success or failure (but I'm pretty confident this will work, and it requires a lot less work than ripping apart the lexer to boot!).

Jon Akhtar

Created November 28, 2012 17:25

That seems like a better idea. I think there is probably an assumption that a lexeme will have a nonzero length. What is the difference between inserting a virtual token and just infering its existance in the parser?

Jon Akhtar

Created November 28, 2012 17:26

Cool. Good luck!

Dmitry Jemerov

Created December 05, 2012 11:29

It is true that the start of the next token must match the end of the previous token, but there's nothing that prevents you from using zero-length tokens, as long as they have the correct offset. Actually that's what the Python plugin uses for handling Python indentation, and it works just fine.

Permanently deleted user

Created December 05, 2012 14:00

Thanks Dmitry, it's very helpful to know that the Python plugin does something similar to what I was doing (I guess the Python plugin is not open source though? :)). I may try that again if my current path doesn't work out (mostly working so far).

That said, does the Python plugin use a different lexer for highlighting vs parsing? The instance I can think of off the top of my head where there was a problem with zero-length tokens was in the LexerEditorHighlighter (line 188 in my version of the sources):

if (tokenStart == lastTokenStart && lexerState == lastLexerState) {

throw new IllegalStateException("Error while updating lexer: " + e + " document text: " + document.getText());
}

I could have sworn there was at least one other case that did a similar check since I did use a lexer that didn't output the layout tokens for highlighting. Anyway, thanks, this let's me know that I'm probably doing something wrong somewhere so I just need to figure out what that is.

Dmitry Jemerov

Created December 05, 2012 14:08

Yes, Python uses different lexers for parsing and highlighting; the one used for highlighting does not generate synthetic tokens.

Jon Akhtar

Created December 05, 2012 16:50

Sorry if I lead you astray. Now that I know that you can do this - I can think of a few ways this might be useful in my plugin as well. Thanks for this thread!

Please sign in to leave a comment.