How to implement Lexer.start/FlexLexer.reset when the lexer carries a stack of states

Answered

I have a lexer which uses a stack to manage its states. I think this is a common pattern and also suggested by https://intellij-support.jetbrains.com/hc/en-us/community/posts/360000594004/comments/360000196484. Unfortunately, this seems to break due to the following limitation.

An essential requirement for a syntax highlighting lexer is that its state must be represented by a single integer number returned from Lexer.getState(). That state will be passed to the Lexer.start() method, along with the start offset of the fragment to process, when lexing is resumed from the middle of a file. -- https://plugins.jetbrains.com/docs/intellij/implementing-lexer.html

I cannot represent a stack of states with one integer. I looked up implementations of other language plugins, but I couldn't find a good solution.

The RegExpLexer of IntelliJ and the elm-plugin (https://github.com/durkiewicz/elm-plugin) seem to ignore the requirement and don't do anything to restore the full state when `Lexer.start` is called.

The Dart plugin deletes any remaining data of previous executions from the stack. I guess this at least ensures deterministic behavior, but the state is still not restored correctly. https://github.com/JetBrains/intellij-plugins/blob/master/Dart/src/com/jetbrains/lang/dart/lexer/DartLexer.java

I found only one solution which actually seems to restore the correct state. It is the IntelliJ-Adaptor provided by ANTLR 4. However, the solution of ANTLR is also more complicated. I also fear that the implementation of ANTLR might cause a memory leak. https://github.com/antlr/antlr4-intellij-adaptor/blob/master/src/main/java/org/antlr/intellij/adaptor/lexer/ANTLRLexerAdaptor.java

I would like to know how language plugins are supposed to work with this limitation. Most plugins just doesn't seem to restore the correct state. Is that a problem or an intentional compromise? Maybe a "good guess" for the correct highlighting is good enough? Does the syntax highlighter revalidate the whole file anyway in a few seconds when the user stops typing?

3 comments
Comment actions Permalink
 
Hey. Sorry for the long response. In general you should try to implement your lexer to return YYINITIAL regullary. That's the best simple way to make highlighter update incremental.
Or you can try to implement RestartableLexer (implement isRestartableState() and make the start method of your lexer to work correctly from not initial state) to support several restartable states. But you should somehow restore the stack from one integer.
For now you can try to find the most commonly used stacks for your problematic states (let's say we are talking about stack of YYINITIAL(0), CODE_STATE(2), LITERAL_STATE_RESTARTABLE(4) stack for LITERAL_STATE_RESTARTABLE(4). and for stack YYINITIAL(0), CODE_STATE (2), SOME_DIFFERENT_STATE(8) LITERAL_STATE_NOT_RESTARTABLE(5) and encode them with a different integer value. 1) make isRestartableState() return true for LITERAL_STATE_RESTARTABLE(4). 2) handle starting your lexer from this specific state with proper restoring the stack (in this case it's YYINITIAL_STATE(0), CODE_STATE(2), LITERAL_STATE_RESTARTABLE(4) if you start from LITERAL_STATE_RESTARTABLE(4).
1
Comment actions Permalink

Here is my current implementation which I might end up using. It actually fixed some strange behavior I have noticed in syntax highlighting. Not sure if this is a good solution, though.

I noticed while debugging that IDEA did never call reset with `initialState != 0`. Maybe IDEA only resets the Lexer to positions where Lexer.getState() has returned 0?

User code in `*.flex` file:

private final TLongArrayList states = new TLongArrayList();
private final TLongIntHashMap stateIndexMap = new TLongIntHashMap();

{
states.add(YYINITIAL);
stateIndexMap.put(YYINITIAL, 0);
}

private int currentStateIndex = 0;
private int parentStateIndex = 0;

public int getStateIndex() {
return currentStateIndex;
}

public void restoreState(int stateIndex) {
long state = states.get(stateIndex);
currentStateIndex = stateIndex;
parentStateIndex = (int) (state >> 32);
yybegin((int) state);
}

private void pushState(int yystate) {
long state = ((long) currentStateIndex << 32) | ((long) yystate & 0x0FFFFFFFFL);
int stateIndex = stateIndexMap.get(state); // Returns 0 if not found
if (stateIndex == 0 && state != YYINITIAL) {
stateIndex = states.size();
states.add(state);
stateIndexMap.put(state, stateIndex);
}
parentStateIndex = currentStateIndex;
currentStateIndex = stateIndex;
yybegin(yystate);
}

private void popState() {
restoreState(parentStateIndex);
}

Adapter implementation:

public class MyLexer extends FlexAdapter {
public MyLexer() {
super(new FlexLexer() {
private final _MyLexer lexer = new _MyLexer(null);

@Override
public void yybegin(int state) {
lexer.restoreState(state);
}

@Override
public int yystate() {
return lexer.getStateIndex();
}

@Override
public int getTokenStart() {
return lexer.getTokenStart();
}

@Override
public int getTokenEnd() {
return lexer.getTokenEnd();
}

@Override
public IElementType advance() throws IOException {
return lexer.advance();
}

@Override
public void reset(CharSequence buf, int start, int end, int initialState) {
lexer.reset(buf, start, end, initialState);
lexer.restoreState(initialState);
}
});
}
}
0
Comment actions Permalink

I found this commit in LexerEditorHighlighter which mentions that you can use LexerTestCase.checkCorrectRestart to test if your lexer is working correctly. The comment also mentions LayeredLexer, but I am not sure why and when it should be used.

Note that IntelliJ actually does not distinguish between different non-initial states by default. IntelliJ will only restart at tokens which have the initial state. You could therefore avoid my complex implementation from above and just return "stack.size()" as state, assuming your stack is initially empty. However, if your lexer does not return to its initial state for a long time, this can cause a sluggish user experience as described in the linked comment.

If you want to avoid such sluggish user experience, it seems you must implement a proper state handling as demonstrated in my last comment. Your lexer must then implement RestartableLexer (source) which tells IntellJ that the lexer can be restarted at other states as well. Note that you should use LexerTestCase.checkCorrectRestartOnEveryToken instead of LexerTestCase.checkCorrectRestart for testing such lexer. Unfortunately, this interface has experimental methods without default implementations, which means your plugin will break if JetBrains changes their experimental API of this interface.

0

Please sign in to leave a comment.