How to implement Lexer.start/FlexLexer.reset when the lexer carries a stack of states

Answered

Created January 17, 2021 20:05

I have a lexer which uses a stack to manage its states. I think this is a common pattern and also suggested by https://intellij-support.jetbrains.com/hc/en-us/community/posts/360000594004/comments/360000196484. Unfortunately, this seems to break due to the following limitation.

An essential requirement for a syntax highlighting lexer is that its state must be represented by a single integer number returned from Lexer.getState(). That state will be passed to the Lexer.start() method, along with the start offset of the fragment to process, when lexing is resumed from the middle of a file. -- https://plugins.jetbrains.com/docs/intellij/implementing-lexer.html

I cannot represent a stack of states with one integer. I looked up implementations of other language plugins, but I couldn't find a good solution.

The RegExpLexer of IntelliJ and the elm-plugin (https://github.com/durkiewicz/elm-plugin) seem to ignore the requirement and don't do anything to restore the full state when `Lexer.start` is called.

The Dart plugin deletes any remaining data of previous executions from the stack. I guess this at least ensures deterministic behavior, but the state is still not restored correctly. https://github.com/JetBrains/intellij-plugins/blob/master/Dart/src/com/jetbrains/lang/dart/lexer/DartLexer.java

I found only one solution which actually seems to restore the correct state. It is the IntelliJ-Adaptor provided by ANTLR 4. However, the solution of ANTLR is also more complicated. I also fear that the implementation of ANTLR might cause a memory leak. https://github.com/antlr/antlr4-intellij-adaptor/blob/master/src/main/java/org/antlr/intellij/adaptor/lexer/ANTLRLexerAdaptor.java

I would like to know how language plugins are supposed to work with this limitation. Most plugins just doesn't seem to restore the correct state. Is that a problem or an intentional compromise? Maybe a "good guess" for the correct highlighting is good enough? Does the syntax highlighter revalidate the whole file anyway in a few seconds when the user stops typing?

5 comments

Johannes Spangenberg

Created January 31, 2021 17:26

Here is my current implementation which I might end up using. It actually fixed some strange behavior I have noticed in syntax highlighting. Not sure if this is a good solution, though.

I noticed while debugging that IDEA did never call reset with `initialState != 0`. Maybe IDEA only resets the Lexer to positions where Lexer.getState() has returned 0?

User code in `*.flex` file:

private final TLongArrayList states = new TLongArrayList();
private final TLongIntHashMap stateIndexMap = new TLongIntHashMap();

{
  states.add(YYINITIAL);
  stateIndexMap.put(YYINITIAL, 0);
}

private int currentStateIndex = 0;
private int parentStateIndex = 0;

public int getStateIndex() {
  return currentStateIndex;
}

public void restoreState(int stateIndex) {
  long state = states.get(stateIndex);
  currentStateIndex = stateIndex;
  parentStateIndex = (int) (state >> 32);
  yybegin((int) state);
}

private void pushState(int yystate) {
  long state = ((long) currentStateIndex << 32) | ((long) yystate & 0x0FFFFFFFFL);
  int stateIndex = stateIndexMap.get(state); // Returns 0 if not found
  if (stateIndex == 0 && state != YYINITIAL) {
    stateIndex = states.size();
    states.add(state);
    stateIndexMap.put(state, stateIndex);
  }
  parentStateIndex = currentStateIndex;
  currentStateIndex = stateIndex;
  yybegin(yystate);
}

private void popState() {
  restoreState(parentStateIndex);
}

Adapter implementation:

public class MyLexer extends FlexAdapter {
    public MyLexer() {
        super(new FlexLexer() {
            private final _MyLexer lexer = new _MyLexer(null);

            @Override
            public void yybegin(int state) {
                lexer.restoreState(state);
            }

            @Override
            public int yystate() {
                return lexer.getStateIndex();
            }

            @Override
            public int getTokenStart() {
                return lexer.getTokenStart();
            }

            @Override
            public int getTokenEnd() {
                return lexer.getTokenEnd();
            }

            @Override
            public IElementType advance() throws IOException {
                return lexer.advance();
            }

            @Override
            public void reset(CharSequence buf, int start, int end, int initialState) {
                lexer.reset(buf, start, end, initialState);
                lexer.restoreState(initialState);
            }
        });
    }
}

Johannes Spangenberg

Created March 14, 2021 13:47

I found this commit in LexerEditorHighlighter which mentions that you can use LexerTestCase.checkCorrectRestart to test if your lexer is working correctly. The comment also mentions LayeredLexer, but I am not sure why and when it should be used.

Note that IntelliJ actually does not distinguish between different non-initial states by default. IntelliJ will only restart at tokens which have the initial state. You could therefore avoid my complex implementation from above and just return "stack.size()" as state, assuming your stack is initially empty. However, if your lexer does not return to its initial state for a long time, this can cause a sluggish user experience as described in the linked comment.

If you want to avoid such sluggish user experience, it seems you must implement a proper state handling as demonstrated in my last comment. Your lexer must then implement RestartableLexer (source) which tells IntellJ that the lexer can be restarted at other states as well. Note that you should use LexerTestCase.checkCorrectRestartOnEveryToken instead of LexerTestCase.checkCorrectRestart for testing such lexer. Unfortunately, this interface has experimental methods without default implementations, which means your plugin will break if JetBrains changes their experimental API of this interface.

Andrey Sokolov

Created July 14, 2021 13:16

Hey. Sorry for the long response. In general you should try to implement your lexer to return YYINITIAL regullary. That's the best simple way to make highlighter update incremental.
Or you can try to implement RestartableLexer (implement isRestartableState() and make the start method of your lexer to work correctly from not initial state) to support several restartable states. But you should somehow restore the stack from one integer.
For now you can try to find the most commonly used stacks for your problematic states (let's say we are talking about stack of YYINITIAL(0), CODE_STATE(2), LITERAL_STATE_RESTARTABLE(4) stack for LITERAL_STATE_RESTARTABLE(4). and for stack YYINITIAL(0), CODE_STATE (2), SOME_DIFFERENT_STATE(8) LITERAL_STATE_NOT_RESTARTABLE(5) and encode them with a different integer value. 1) make isRestartableState() return true for LITERAL_STATE_RESTARTABLE(4). 2) handle starting your lexer from this specific state with proper restoring the stack (in this case it's YYINITIAL_STATE(0), CODE_STATE(2), LITERAL_STATE_RESTARTABLE(4) if you start from LITERAL_STATE_RESTARTABLE(4).

Joachim Ansorg

Created March 21, 2023 16:27

Late reply, but hopefully still helpful.

https://github.com/JetBrains/intellij-community/blob/master/plugins/sh/core/src/com/intellij/sh/lexer/Sh.flex is a good example of state management with a Flex-based lexer. You basically use JFlex's `int` state as state, which is returned as the IDE lexer's state. I don't see a reason to keep states as `long` in the current implementation of the Nix lexer.

Unless you're implementing a complex lexer with `RestartableLexer`, just return `0` to make a lexer restartable (`YYINITIAL` is `0`).

Johannes Spangenberg

Created March 22, 2023 20:58

Joachim Ansorg, thanks for your reply. As long as you are not stacking your states, you are right. When you are not using `RestartableLexer`, you may even use a stack as long as you are not calling `pop` while in state `YYINITIAL`.

In my case, I took the previous Flex state from the stack whenever I was reading `}`. Just the state integer of Flex wasn't enough, I had to encode the stack into the overall state of my lexer. Anyway, I could have avoided the complexity by just not using `YYINITIAL` as long as my stack is not empty (assuming I am not implementing `RestartableLexer`).

Please sign in to leave a comment.