Incremental reparsing and rehighlighting


Hello :-)

I have implemented a lexer to use in my `parserDefinition` and `syntaxHighlighterFactory` extension points. It isn't a state-based lexer. It is considerably more sophisticated than your usual lexer, and it is rather slow, so it is vital to use incremental reparsing.

The lexer does recognize that there are certain points where it can be safely restarted in a "default"/zero state. 

So what the lexer does is claim that it is in state 0 at those points, and claim to be in some other, unique state at any other point. Unfortunately, the way the lexer is being used by the PsiBuilderImpl (for parsing) and the LexerEditorHighlighter (for Highlighting) are not quite how I expected.

PsiBuilderImpl doesn't seem to do incremental relexing at all! It just asks for the whole document to be relexed every time. Am I missing something? Is it possible to get incremental relexing to happen for parsing without reimplementing (parts) of PsiBuilderImpl?

LexerEditorHighlighter does do somewhat incremental relexing, but it seems to have a problem knowing when to stop. I expected that if I insert a character, the highlighter will relex from the previous offset marked with the zero state. Indeed this seems to be happening.

But I would also expect that, if at some point it receives a token that is in the zero state, and that token was *already* in the zero state prior to me inserting the character, then the highlighter should *not* request any further tokens (obviously they will be all the same as before, since the tokenizer is in the same state in the same position of the file which is unchanged after that point). This is not what is happening.

What seems to be happening is that the tokenizer asks for a relexing of the text from the previous zero-state position up, up to the next zero-state position appearing at the beginning of a line. I assume that I need to change the LexerEditorHighlighter, is this correct?

Thank you for your attention,

Comment actions Permalink

Hello Bruno,

Lexer needs to be light-weight. Lexing happens very often. Trying to control it: when it runs, how much it parses - is a futile exercise.

If lexer runs too slowly, refactor it so it performs only very basic parsing.

The stage 2 (last stage) parsing : forming complex elements from simple ones (as output by lexer) - is task of a parser. Parser runs less often. Parser does not repeat lexer's tasks. Parser forms simple elements into a tree. Parser can look ahead and behind. It works with simple elements identified by lexer.

Hope this helps.

Comment actions Permalink

Hi Bruno,

Your observations are generally correct; also I could not disagree with Imants: Lexer should be ultra-fast to cope with the fast editing. One cannot guarantee that some "bad" change will not require re-lexing of the whole file 'till the end because all these "safe zero marks" will shift somehow. I understand your proposal of relexing some "minimal" segment of the document and it seems logical to me. However, this is a very fundamental piece of IDE and the profit of introducing such change is not so obvious — according to our experience lexing is not a performance bottleneck by any means.

> PsiBuilderImpl doesn't seem to do incremental relexing at all! It just asks for the whole document to be relexed every time. Am I missing something? Is it possible to get incremental relexing to happen for parsing without reimplementing (parts) of PsiBuilderImpl?

As Imants already stated, parser runs quite seldom: you can expect smth around "once per 300ms". So complete re-lexing should not matter much in that case, too.

Try to simplify your lexer if that's possible. I do myself have very complex context-dependent lexing in Markdown plugin, however after some tricks it works quite smoothly so I think there is way to solve everything without hacking the platform :)


Comment actions Permalink

I get what you're saying guys, I really do.

I *am* using IJ to do something it wasn't really meant to do: I have an external program to do the parsing. This is a necessity since the programming language I'm writing a plugin for can itself be used to change the behaviour of the lexer and parser (think Lisp). While I could, with lots of work, rewrite the lexing algorithm in Java, without actually asking the compiler to do it I cannot hope to lex/parse custom language constructs.

So in actuality what the IJ "lexer" does is invoke the *parser* and ask it to split the file into a bunch of BEGIN/END tokens. These are then used to set the markers during PSI building.

The compiler is written in python, so it isn't the speediest creature it could be. Processing a 300 line file takes about 250ms. As It stands, a single new character seems to relex the whole file twice, and part of the file a few times.

It seems that in order to get the correct re-highlighting behaviour for my language I need to use the editorhighlighter extension point and implement some variant of LexerEditorHighlighter. This is now clearer in my head, thanks.

I think that in order to get a faster parsing, I will cache the results of lexing and update them dynamically when the text changes. Some work lies ahead :-)

Thank you for your time,

Comment actions Permalink

Yes, caching parsing results per text is a good idea. However, with 250ms lexing it is likely to be barely usable.

You can try to workaround this in faster yet still awful ways — for example, consider every character as a token. Next step could be joining same whitespaces and "words" since they are very rarely represent different tokens. And this could be enough — as long as lexer-side partition of the file is a superset of parser-side partitioning, you're good.
By the way, there is no need to implement lexer in Java. Our common practice is to use JFlex for that. The code is very short for simple lexers.

With such workarounds you won't have immediate highlighting, but you will have fast typing which is necessary.

Hope this helps :)



Please sign in to leave a comment.