Custom language: Weird string lexing problem

I have a custom language in which the following JFlex macro defines the
string literal token:

STRING_LITERAL = \" ( [^
\"\r\n] | "
" )* \"

This appears to work when viewing files which already exist. But when I
am typing new code, I get the following problem:

1. Type opening quote.
2. Type some characters.
3. Type a backslash.
4. Type another character.
5. Type closing quote.

Now, according to the PSI structure (all hail PsiViewer), my language
parser has correctly recognized it as a string literal element. However,
it has not been highlighted as such. The opening and closing quotes as
well as the backslash are marked as bad characters (red/pink background).

However, if I go back and type another character somewhere before the
backslash, the string suddenly highlights as correct. I can also insert
backslashes anywhere inside the string except just before the closing
quote, at which point the highlight reverts to the bad state like I
originally described.

Any idea what's going on here?

Thanks,
Gordon

--
Gordon Tyler (Software Developer)
Quest Software <http://www.quest.com/>
260 King Street East, Toronto, Ontario M5A 4L5, Canada
Voice: (416) 933-5046 | Fax: (416) 933-5001

9 comments

This sounds like an IDEA cache consistency bug. Maybe you should post a sample
project for Dmitry, Eugene, or Maxim to look at.

Gordon Tyler wrote:

I have a custom language in which the following JFlex macro defines the
string literal token:

STRING_LITERAL = \" ( [^
\"\r\n] | "
" )* \"

This appears to work when viewing files which already exist. But when I
am typing new code, I get the following problem:

1. Type opening quote.
2. Type some characters.
3. Type a backslash.
4. Type another character.
5. Type closing quote.

Now, according to the PSI structure (all hail PsiViewer), my language
parser has correctly recognized it as a string literal element. However,
it has not been highlighted as such. The opening and closing quotes as
well as the backslash are marked as bad characters (red/pink background).

However, if I go back and type another character somewhere before the
backslash, the string suddenly highlights as correct. I can also insert
backslashes anywhere inside the string except just before the closing
quote, at which point the highlight reverts to the bad state like I
originally described.

Any idea what's going on here?

Thanks,
Gordon

0

This is most probably due to incremental relexing after typing into the editor.

The raw contract for the incremental relexing is as follows.
Let say n - is an offset of the change made to the document. Then k - maximum
start offset of the lexem, such as k < n among all of the lexems in prior
lexing. In human words - fall back to the start offset of the damaged lexem.
Then fall back to the first undamaged lexem that also has initial lexer state
(0 or YYINITIAL for JFlex generated lexers). That would be an initial position
to start incremental relexing. Relexing goes until damaged area is covered
and old lexing tokens are in sync with new lexing tokens at some position
where lexer has initial state.

So lexer implementation is expected to give consistent results when asked
to relex from any given position it claimed as initial.

Feel free to ask for further clarifications. I understand this is not trivial
stuff...

-


Maxim Shafirov
http://www.jetbrains.com
"Develop with pleasure!"

I have a custom language in which the following JFlex macro defines
the string literal token:

STRING_LITERAL = \" ( [^
\"\r\n] | "
" )* \"

This appears to work when viewing files which already exist. But when
I am typing new code, I get the following problem:

1. Type opening quote.
2. Type some characters.
3. Type a backslash.
4. Type another character.
5. Type closing quote.
Now, according to the PSI structure (all hail PsiViewer), my language
parser has correctly recognized it as a string literal element.
However, it has not been highlighted as such. The opening and closing
quotes as well as the backslash are marked as bad characters (red/pink
background).

However, if I go back and type another character somewhere before the
backslash, the string suddenly highlights as correct. I can also
insert backslashes anywhere inside the string except just before the
closing quote, at which point the highlight reverts to the bad state
like I originally described.

Any idea what's going on here?

Thanks,
Gordon



0

So Max, are you saying JFlex has a bug? I believe Gordon is using stock JFlex
(but I don't know).

Maxim Shafirov (JetBrains) wrote:

This is most probably due to incremental relexing after typing into the
editor.

The raw contract for the incremental relexing is as follows.
Let say n - is an offset of the change made to the document. Then k -
maximum start offset of the lexem, such as k < n among all of the lexems
in prior lexing. In human words - fall back to the start offset of the
damaged lexem. Then fall back to the first undamaged lexem that also has
initial lexer state (0 or YYINITIAL for JFlex generated lexers). That
would be an initial position to start incremental relexing. Relexing
goes until damaged area is covered and old lexing tokens are in sync
with new lexing tokens at some position where lexer has initial state.

So lexer implementation is expected to give consistent results when
asked to relex from any given position it claimed as initial.

Feel free to ask for further clarifications. I understand this is not
trivial stuff...

-------------------
Maxim Shafirov
http://www.jetbrains.com
"Develop with pleasure!"

>> I have a custom language in which the following JFlex macro defines
>> the string literal token:
>>
>> STRING_LITERAL = \" ( [^
\"\r\n] | "
" )* \"
>>
>> This appears to work when viewing files which already exist. But when
>> I am typing new code, I get the following problem:
>>
>> 1. Type opening quote.
>> 2. Type some characters.
>> 3. Type a backslash.
>> 4. Type another character.
>> 5. Type closing quote.
>> Now, according to the PSI structure (all hail PsiViewer), my language
>> parser has correctly recognized it as a string literal element.
>> However, it has not been highlighted as such. The opening and closing
>> quotes as well as the backslash are marked as bad characters (red/pink
>> background).
>>
>> However, if I go back and type another character somewhere before the
>> backslash, the string suddenly highlights as correct. I can also
>> insert backslashes anywhere inside the string except just before the
>> closing quote, at which point the highlight reverts to the bad state
>> like I originally described.
>>
>> Any idea what's going on here?
>>
>> Thanks,
>> Gordon


0

Maxim,

I think I understand the algorithm used for re-lexing, but how can JFlex generate results
that are not consistent with this contract?

I suspect that the described problem is something similar to the following phenomenon I
observed: In a JavaScript file, type "var x = (". Now the opening paren is highlighted as
invalid character, which vanishes (the highlighting) once the closing paren is typed. This
does not happen when modifying a complete statement though: Type "var x = y;" and move the
caret one column to the left before the semicolon. Now type "(". The paren is not
considered an illegal character and the automatic insertion of the closing paren works. I
have no idea how to interpret that, but it doesn't seem right to me...

Sascha

0

So Max, are you saying JFlex has a bug? I believe Gordon is using
stock JFlex (but I don't know).


Nope. I mean the problem is most probably dealing with lexer states, which
are solely handled by the programmer.

-


Maxim Shafirov
http://www.jetbrains.com
"Develop with pleasure!"


0

I do not think these two problems are similar but surely will look into this
JavaScript one. Thanks.
-


Maxim Shafirov
http://www.jetbrains.com
"Develop with pleasure!"

Maxim,

I think I understand the algorithm used for re-lexing, but how can
JFlex generate results that are not consistent with this contract?

I suspect that the described problem is something similar to the
following phenomenon I observed: In a JavaScript file, type "var x =
(". Now the opening paren is highlighted as invalid character, which
vanishes (the highlighting) once the closing paren is typed. This does
not happen when modifying a complete statement though: Type "var x =
y;" and move the caret one column to the left before the semicolon.
Now type "(". The paren is not considered an illegal character and the
automatic insertion of the closing paren works. I have no idea how to
interpret that, but it doesn't seem right to me...

Sascha



0

Maxim Shafirov (JetBrains) wrote:
>> So Max, are you saying JFlex has a bug? I believe Gordon is using
>> stock JFlex (but I don't know).


Nope. I mean the problem is most probably dealing with lexer states,
which are solely handled by the programmer.


Yes, I think the problem is caused by my lex grammer being defined to
match the whole string as a single entity. Thus when a character is
entered which invalidates that, it's no longer recognized as a string
token but rather as: bad char '"', identifier token, bad char '
'. And
then when I type the closing '", it doesn't relex from the start of the
string.

So, to fix this, I need to use multi-state lexing for strings. Make the
grammar start a new state when it find a '"', which then has separate
rules for processing of characters until it finds a closing '"', and
then resets back to YYINITIAL.

Right?

Ciao,
Gordon

--
Gordon Tyler (Software Developer)
Quest Software <http://www.quest.com/>
260 King Street East, Toronto, Ontario M5A 4L5, Canada
Voice: (416) 933-5046 | Fax: (416) 933-5001

0

Gordon Tyler schrieb:

So, to fix this, I need to use multi-state lexing for strings. Make the
grammar start a new state when it find a '"', which then has separate
rules for processing of characters until it finds a closing '"', and
then resets back to YYINITIAL.


That's the right way, I think.
I stumbled on similar problem when collecting comments as entities.
Ever wondered about the comment definition in javascript-plugin?
C_STYLE_COMMENT=("/"[^""])|"/*"
COMMENT_TAIL=([""]("*"+["""/"])?)("*"+"/")?
it simply allows that the closing */ is missing. So in this case
the annotator has to check for this kind of errors.

0

Maxim Shafirov (JetBrains) wrote:

I do not think these two problems are similar but surely will look into
this JavaScript one. Thanks.


I guess you're right. Note that this isn't a JavaScript-only problem but can be observed
in other languages as well. It also happens when typing a quote character, i.e. the
closing one isn't automatically inserted unless some other text comes after it.

Are you looking into it already or should I file a request?

Sascha

0

Please sign in to leave a comment.