Problem with HighlighterIterator passed to QuoteHandler

Created February 06, 2017 17:10

I'm having an issue with quote handling in my custom language plugin. I realize that there are MANY potential contributing factors, but I'll explain the behavior and share my findings so far in hopes that it's enough to drive some guidance on how it might be resolved.

Basically if you have a line like:

String str = foo;

which you change to:

String str = 'foo;

and then:

String str = 'foo';

by adding quotes at the beginning and end of the desired string respectively, all is well. However, if you do the same thing with:

String str = foo.bar;

then when you try to add the closing quote, it doubles the quote and doesn't highlight the actual string literal properly:

String str = 'foo.bar'';

Looking at PsiViewer, 'foo.bar' is properly recognized by my lexer/parser as a string literal, and if you cut/paste the string literal it highlights properly, so it seems to be an issue with the highlighter.

To add some substance to that assertion, stepping through the code in my SimpleTokenSetQuoteHandler implementation I'm seeing that in the former/working case isOpeningQuote() is passed a HighlighterIterator with a text range that includes the full quote ('foo') and returns false, but in the latter/broken case it's passed a HighlighterIterator with a text range that includes only the just-typed closing quote and returns true, hence the doubled quotes. It thinks it's auto-closing a new string literal.

So obviously the issue is that the HighlighterIterator is being computed improperly in the latter case, but I'm not sure what options are available for me to control how that determines the range. My hope/guess is that it's just a matter of implementing some EP properly, but I'm not sure which one to implement.

Any thoughts?

13 comments

Dmitry Batrak

Created February 07, 2017 08:22

HighlighterIterator's behaviour shouldn't depend on which offset it's being constructed for - it always returns same tokens (usually provided by lexer). Initial iterator's offset just determines its current token. If you're using some standard EditorHighlighter for your language, it should be the case already, if you're using a custom one, please check its implementation.

Also, SimpleTokenSetQuoteHandler is written assuming the whole string literal is treated as a single token by lexer/HighlighterIterator. This is probably not the case for your language/lexer. In that case you'll need some more complex logic in your QuoteHandler. See e.g. how JavaQuoteHandler.isOpeningQuote handles the case of escape characters present in string literal - they are also represented as separate tokens by Java lexer.

Scott Wells

Created February 07, 2017 14:08

Thanks, Dmitry. Actually string literals are a single token in my language. I'm just using the standard string literal regular expression:

'([^'\\]|\\.)*'

and aside from the issue I'm hitting here, it's worked very well.

I did look at overriding isOpeningQuote() to solve this, and I did determine a pattern for the errant case and was able to adjust for it. While this did prevent the extraneous double-quoting when typing the closing quote, it didn't resolve the highlighter issue.

I haven't registered a custom EditorHighlighter implementation for this language, but I'll take a look at what it's doing here and see if perhaps I can determine/resolve the issue there. Let me see what I can find and I might end up having additional questions for you if you don't mind.

Dmitry Batrak

Created February 07, 2017 14:19

What about this code in your language:

String str = 'foo.bar';

Is it highlighted properly? Is the whole string literal parsed as a single token by your lexer? If yes, then it looks like you shouldn't have the problem with SimpleTokenSetQuoteHandler you've described.

Scott Wells

Created February 07, 2017 14:33

That's right, Dmitry. In the code you provided above, the portion:

'foo.bar'

is a single string literal token. It seems that when there's a dot (.) inside an unquoted set of tokens which are recognized by my lexer as IDENT (DOT IDENT)*, then I add an opening quote (') before that set of tokens, and finally I try to add a closing quote (') after that set of tokens, the resulting iterator ends up with a different text range (iterator.getStart()/getEnd()) than if there were not dot. As a result, adding a closing to quote to:

String str1 = 'foo;

works just fine and results in (actual screenshot):

but adding a closing quote to:

String str1 = 'foo.bar;

causes the errant behavior and results in:

If, however, you look at PsiViewer for the latter, the lexer and parser both seem to be doing the right thing:

It just seems that the highlighter iterator is getting into a bad state which results in both improper highlighting and improper behavior in the quote handler.

Hopefully that level of detail helps. Please let me know if you have any other thoughts as this is definitely odd behavior.

Dmitry Batrak

Created February 08, 2017 12:03

First of all the decision to enter a second quote (leading to the broken code) in the IntelliJ platform code is taken when the document contents is like this:

String str = 'foo.bar';

i.e. after the first quote was already inserted (see last part of com.intellij.codeInsight.editorActions.TypedHandler#handleQuote method). That's why I draw your attention to this case.

Unless you provide your own implementation of HighlighterIterator (I assume you don't, based on your previous comments), its getStart()/getEnd() methods just reflect the lexer token boundaries at that document state (you can check implementation of com.intellij.openapi.editor.ex.util.LexerEditorHighlighter.HighlighterIteratorImpl - it has no state by itself apart from segment's index). Most probably the problem you describe means there's some issue with your lexer, so that on document update LexerEditorHighlighter's state is not updated correctly. Try to check what state it's in at that point in the code using debugging.

Scott Wells

Created February 08, 2017 14:48

Thanks again, Dmitry. Just to clarify, I definitely understand the nature of the QuoteHandler. I was just commenting that by the time my QuoteHandler implementation is invoked, in the case that works ('foo'), the range is the entire string literal, and in the case that doesn't work ('foo.bar'), the range is just the closing quote. As a result, whatever is happening is definitely upstream of the quote handler, and the quote handler's errant behavior is just a side-effect of the root cause.

I'll continue trying to debug the highlighter and my lexer. The odd thing is that if I type the statement that produces the errant behavior as I indicated above, I see the errant behavior, but if I then cut that line and paste it back, all is good. Whatever is happening seems to be with a temporary/transitional state while entering the text rather than the actual lexer/parser's stable state which seems correct. The parser doesn't seem to be involved here, but it almost reminds of when I was tuning the recovery rules in the parser...

Scott Wells

Created February 13, 2017 03:45

Dmitry, I debugged this quite a bit over the weekend and never could get to the bottom of why the highlighter iterator was treating "'foo" differently than "'foo.bar". In the end I overrode isOpeningQuote() in my QuoteHandler to recognize and correct the situation in a quite generalized fashion.

The only remaining issues were that the resulting string literal wasn't highlighted properly and the opening quote still showed a BAD_CHARACTER error. To correct those issues I just forced a reparse of the current file (FileContentUtil.reparseFiles()) when this specific situation is recognized.

The end result is seamless from the user's standpoint, but I wondered if perhaps there was a lighter weight way to reparse/reprocess just the string literal (or some other subset of the tree) and not the entire file. Any thoughts, or is what I'm doing now about as good as it will get?

Thanks for the help!

Dmitry Batrak

Created February 13, 2017 06:55

Have you tried to debug LexerEditorHighlighter.documentChanged() ? It's the place where incremental re-lexing takes place. (I assume when lexing the whole file, e.g. on initial editor opening, everything works fine for you).

Scott Wells

Created February 13, 2017 13:40

Thanks, Dmitry. You're correct that if you close and reopen the file, all is good. Similarly if you cut/paste the string literal, everything is fine.

Thanks for the pointer. LexerEditorHighlighter.documentChanged() is exactly what I was debugging yesterday to try to understand this. In the end I was having a hard time understanding why the segments were different and what effect that was having in terms of root cause analysis.

Once I landed on my workaround (vs. root cause fix), I noticed that documentChanged() was calling HighlighterClient.repaint(startOffset, endOffset) so I figured out how to get from the document to that call and tried to invoke it with the start and end offsets of the resulting string. Unfortunately that didn't seem to put things into a good state again. I also tried Document.replaceString(startOffset, endOffset, sameText) to see if that might cause it to reprocess the string as if it has been cut and immediately pasted by the user, but that also didn't resolve it. Of the options I tried, only reparsing the entire file did the trick.

If I end up sticking with my current workaround, it won't be the end of the world. I typically prefer not to have workarounds unless absolutely necessary. They're obviously fragile and often the core issue manifests in other ways which can lead to a series of workarounds. I'm hoping that won't be the case here, and if it does start looking that way I'll dig back into the original problem. However, the end behavior for the user right now is exactly what would be expected. Just feels kind of dirty to me...

Scott Wells

Created February 13, 2017 19:22

Another update...evidently the approach of reparsing doesn't work for LightVirtualFiles either. I use those to host EditorTextFields for various purposes. I tried again over lunch to see if I could get something other than a full reparse to cause the editor to reevaluate the string, but nothing else seems to do it. This may just be what it is at this point, but if anyone did have any other thoughts, they'd certainly be much appreciated!

Scott Wells

Created February 14, 2017 15:11

Hi, Dmitry. I did debug the root cause issue a bit more and found something quite interesting that I'm hoping might help shed some light on what's happening and what I might to do address it properly. I set a breakpoint in LexerEditorHighlighter.documentChanged() and dumped "mySegments" for my language for both "foo" and "foo.bar" and then did the same thing for Java.

Here are the findings in each case typing a closing quote before the statement terminator (;):

My language for String str = 'foo;

Segment 28 [144,147] = 'String'
Segment 29 [147,148] = ' '
Segment 30 [148,151] = 'str'
Segment 31 [151,152] = ' '
Segment 32 [152,153] = '='
Segment 33 [153,154] = ' '
Segment 34 [154,155] = '''
Segment 35 [155,158] = 'foo'
Segment 36 [158,159] = '''
Segment 37 [159,160] = ';'

My language for String str = 'foo.bar;

Java for String str = "foo;

Segment 8 [29,35] = 'String'
Segment 9 [35,36] = ' '
Segment 10 [36,39] = 'str'
Segment 11 [39,40] = ' '
Segment 12 [40,41] = '='
Segment 13 [41,42] = ' '
Segment 14 [42,47] = '"foo"'
Segment 15 [47,48] = ';'

Java for String str = "foo.bar;

Segment 8 [29,35] = 'String'
Segment 9 [35,36] = ' '
Segment 10 [36,39] = 'str'
Segment 11 [39,40] = ' '
Segment 12 [40,41] = '='
Segment 13 [41,42] = ' '
Segment 14 [42,51] = '"foo.bar"'
Segment 15 [51,52] = ';'

The obvious difference is that in Java, by the time that documentChanged() is invoked it appears the full string has been properly tokenized (segment 14 in each Java example), but in my language there are still distinct segments for the open quote, string body content, and close quote (segments 34-36/38 respectively).

Any idea what might be causing that difference in behavior? Thanks in advance for any insights you can provide!

UPDATE: It appears that Java's string literal token is a bit smarter than mine. Even an unclosed string literal, e.g., "foo.bar, is treated as a single segment. How is it doing that? How does it know that the semicolon statement terminator should not be considered part of the string and therefore the same segment?

Scott Wells

Created February 14, 2017 16:17

Okay, I think I've managed to resolve this by changing my string literal expression from:

'([^'\\]|\\.)*'

to:

'([^'\\\n]|\\.)*['\n]

which is actually more correct since in this language, strings cannot be multi-line without using a concatenation operator. I need to test the change more, but my hope is that this puts this entire issue to rest. I'll follow up either way to help anyone else who might have the same issue in the future.

UPDATE: Two things on this:

I'm trying to work out one last issue with the expression above where it doesn't work properly if the unterminated string literal ends with end-of-file instead of end-of-line. I have a question out to the JFlex users list to see if there's a good way to handle that.
Because this expression detects grammatically incorrect string literals, I'll also need to add an inspection that marks unterminated string literals as errors since the lexer/parser will not.

Scott Wells

Created February 17, 2017 16:51

Okay, here's the final (and vastly simpler) solution. I changed my lexer token for a string literal to:

BAD_STRING_LITERAL='([^'\\\r\n]|\\.)*
STRING_LITERAL={BAD_STRING_LITERAL}'
...

%%

<YYINITIAL> {
  ...

  {STRING_LITERAL}     { return STRING_LITERAL; }
  {BAD_STRING_LITERAL} { return com.intellij.psi.TokenType.BAD_CHARACTER; }

  ...
}

And that's pretty much it. Now valid string literals are always recognized immediately and unterminated string literals are reported to the world as BAD_CHARACTER which I'd already configured my QuoteHandler to consider in its handling.

I've tested this quite extensively and it seems rock solid, but of course if it does end up having issues I'll post here to let any interested parties know. Hopefully this is my final iteration on this. Dmitry, thanks again for all the great discussion on this!

Please sign in to leave a comment.