Can't understand lexer workflow

I'm playing with IDEA's custom language support, and can't get the trick.
Suppose, I'm writing a simple grammar for syntax highliting. It consists of exactly 3 expressions. These are:
"keyword" {return MyTokenTypes.KEYWORD;}
+ {return FMTokenTypes.BAD_CHARACTER;}
. { return FMTokenTypes.BAD_CHARACTER; }

I use no custom states and classes hierarchy is repeating JavaScript's plugin one. So I'm expecting the word "keyword" to be highlighted, and any other expressions to be marked as bas symbols.
But as I haven't got the trick yet, I get strange (for me) results. If i start typing char by char "k","e","y","w","o","r","d" it does not highlight. But if I erase and retype one of three first symbols in the word, the word gets lexed and highlighted as i want. And of course any modification before the unhighlighted word "keyword" makes it highlighted.
After further FlexAdapter's debug I found out that the symbols sequence that Lexer gets for analysis is a bit weird. First 4 symbols behave well - i.e. each symbol increments endOffset, and lexer gets "k", "ke", "key", "keyw". But then this kind'of 4 chars window slides as i type, and lexer gets "eywo", "ywor", "word", and no match occurs. If i remove "y" (leaving "keword"), the charsequence for lexing is "keword" (still no match), and on typing "y" again the lexer gets "keyword" and makes a match. But if I repeat this with fourth letter "w" nothing good happens. I get a "eyord" for lexing and have no match, then type "w" again and get "eyword" - no match again.
I debugged JavaScript's lexer and it gets good sequence for lexing every time, i.e. the length of analyzed token increases as I type "function" for example.

So the question is - why does lexer provide me with 4 chars instead of full typed string, and what is the solution? I've read all relating threads, but it seems that I missed something important.

3 comments

Editor does incremental relexing from initial state,
JavaScript matches partial production for keyword as identifier (e.g.
'func'), and thus gets it updated.

Max Ishchenko wrote:

I'm playing with IDEA's custom language support, and can't get the trick.
Suppose, I'm writing a simple grammar for syntax highliting. It consists of exactly 3 expressions. These are:
"keyword" {return MyTokenTypes.KEYWORD;}
+ {return FMTokenTypes.BAD_CHARACTER;}
. { return FMTokenTypes.BAD_CHARACTER; }

I use no custom states and classes hierarchy is repeating JavaScript's plugin one. So I'm expecting the word "keyword" to be highlighted, and any other expressions to be marked as bas symbols.
But as I haven't got the trick yet, I get strange (for me) results. If i start typing char by char "k","e","y","w","o","r","d" it does not highlight. But if I erase and retype one of three first symbols in the word, the word gets lexed and highlighted as i want. And of course any modification before the unhighlighted word "keyword" makes it highlighted.
After further FlexAdapter's debug I found out that the symbols sequence that Lexer gets for analysis is a bit weird. First 4 symbols behave well - i.e. each symbol increments endOffset, and lexer gets "k", "ke", "key", "keyw". But then this kind'of 4 chars window slides as i type, and lexer gets "eywo", "ywor", "word", and no match occurs. If i remove "y" (leaving "keword"), the charsequence for lexing is "keword" (still no match), and on typing "y" again the lexer gets "keyword" and makes a match. But if I repeat this with fourth letter "w" nothing good happens. I get a "eyord" for lexing and have no match, then type "w" again and get "eyword" - no match again.
I debugged JavaScript's lexer and it gets good sequence for lexing every time, i.e. the length of analyzed token increases as I type "function" for example.

So the question is - why does lexer provide me with 4 chars instead of full typed string, and what is the solution? I've read all relating threads, but it seems that I missed something important.



--
Best regards,
Maxim Mossienko
IntelliJ Labs / JetBrains Inc.
http://www.intellij.com
"Develop with pleasure!"

0

Hi, I know this thread is ancient but I am having a similar problem with a custom plugin and wonder how you solved this. Thanks very much, any help is appreciated.

0

I would post code and pehaps more detail. There really wasn't anything to solve here. It does help to have something that separates your tokens. The example above had no token separators other than bad characters.

0

Please sign in to leave a comment.