Syntax Highlighter Lexer with yycolumn

Hello all

After reading through the documentation on how to write syntax highlighting lexers I have realized that they need to be able to enter in any 0 state. The language I am writing a highlighter for is sensitive about the column some text is written in. For example comments can only start if they have an asterisk in the first column of a line. I have created my rules in the lexer like this:

"*" { if(yycolumn == 0) { yybegin(LINE_COMMENT); return LEGACY_COMMENT; } else { return TokenTypes.OPERATOR; } }

This works fine when the entire file is parsed at once, however when I am working after a comment and start deleting characters it no longer works. I have added a little gif showing what I mean:

It seems like when the lexer is called for a portion of code that was changed the yycolumn does no longer reflect the actual column in the file. Is this true? And if yes, is there a way to prevent that? Columns are an essential indicator for this use case.

Thank you in advance,
Yanick

0
4 comments

I tried to do the following:

  • Start in YYINITIAL
  • Check first character
  • * -> comment
  • ? -> compiler directive
  • EOL -> just treat as white space
  • white space -> switch to state LINE_CONTENT
  • everything else -> BAD_CHARACTER
  • EOL inside state LINE_CONTENT -> YYINITIAL

Then my entire handling that used to be in YYINITIAL is now in LINE_CONTENT and only the first character of every line has a state of YYINITIAL and the lexer can only enter at that point. So it must always parse the entire line.

Does this sound like a reasonable approach? It seems to work so far and performance still is good even for large files.

0
Avatar
Permanently deleted user

Something is off with your code and I suspect it will be hard to give you advice without seeing the complete code. I have several states in my jflex lexer (strings, nested comments, specific rules) and it works for me. Can you share at least your JFlex code?

0

Yea, all the other rules are no problem, the only peculiar one is the one where the column in a line is important to determine what the next state is. I assumed when IntelliJ decides a part needs to be relexed and it finds the closest range of 0-state (YYINITIAL) markers enclosing the changed area and resets the lexer to the start of that area also the column of where it starts is calculated, but that seems not to be the case, when it relexes an area, no matter at what index within the line it restarts lexing, yycolumn is set to 0. Which means I now no longer know if a * is at the very first character of a line or not.

My solution above just makes sure only the first characters of every line have the YYINITIAL state, and everything else is in a state != 0 thus prohibiting IntelliJ to start lexing from within a line and only from the beginning of a line. Like that there no longer really is a problem that yycolumn is wrong, because it wrongfully not being restored happens to be at the column it is supposed to be.

0

Yes, returning 0 state only at the beginning of the line is the best approach known to me to handle this case.

0

Please sign in to leave a comment.