Editing document yields exception; wrong index into buffer

Permanently deleted user

Created September 25, 2013 02:53

a = "xy";

when I kill the quote on end, I get:

java.lang.IndexOutOfBoundsException: Wrong offset: 7. Should be in range: [0, 4] Lexer: com.simpleplugin.ANTLRv4LexerAdaptor@673f27a1
at com.intellij.openapi.editor.ex.util.LexerEditorHighlighter.documentChanged(LexerEditorHighlighter.java:162)
at com.intellij.openapi.editor.impl.DocumentImpl.changedUpdate(DocumentImpl.java:592)

which is right here in documentChanged():

    try {
      segmentIndex = mySegments.findSegmentIndex(oldStartOffset) - 2;
    }
    catch (IndexOutOfBoundsException ex) {
      throw new IndexOutOfBoundsException(ex.getMessage() + " Lexer: " + myLexer); <-------------
    }

The error is because there are only 0..4 ranges in the regions:

com.intellij.openapi.editor.ex.util.SegmentArrayWithData@33a515d3
mySegments =

myData = {short[64]@12855}
myStarts = {int[64]@12856}
myEnds = {int[64]@12857}

mySegmentCount = 4 and oldStartOffset = 7. Anybody have any ideas what the segments are? I can imagine they are tokens but why would we compare the index into the character buffer with a token index? Can anybody explain why the document update event has the right character index but the mySegments variable has the wrong arrays or wrong size anyway?

It doesn't kill the editor as I can add the " back in (but I get same exception to add it back in). My input is delete " and then add " to get these exceptions.

Thanks!
Ter

13 comments

Jon Akhtar

Created September 25, 2013 06:26

So dump out your tokens in both cases, along with their char ranges. Perhaps you can use the PsiViewer.

When the document changes there is a time where the text does not match the PSI tree until the document is committed.

Is your lexer restartable?

I think your tokens are at first:

KEY SEPARATOR VALUE SEPARATOR

Is that right?

It wants to relex the portion of the document that changed. Will your lexer support that?

Permanently deleted user

Created September 25, 2013 20:43

Hi! ANTLR starts out with tokens from builder.advanceLexer(): a, =, "x", ;, EOF. I have attached what IDEA thinks the tree and tokens are. Note the offset range is (4,7) for the string, which is correct. Now, killing the last quote is the issue because then my tokenizer tries to read all the way to the end of file and that probably messes things up. actually, that works fine. It gives me a red underline after the semicolon and an error node within the PSI that says:

node PsiErrorElement:mismatched input '<EOF>' expecting {INT, STRING, ID}

Putting the quote back in is what causes the trouble I believe. I get the exception:

Wrong offset: 6. Should be in range: [0, 4].

According to the code, it looks like a segment or token index and it's expecting that character index 6 is between the start and stop of segment[i] for some i in 0..4.

oldStartOffset is 6, which was the start of the semicolon before I killed the ".  Issue seems to be here:

  public int getLastValidOffset() {
    return mySegmentCount == 0 ? 0 : myEnds[mySegmentCount - 1];
  }

That function returns 4. mySegmentCount is 4 and myEnds[4-1] = 4, which explains the exception but not why I'm getting the exception. It seems that the last valid offset is getting the start of the string as the last element of that token. Maybe that's the issue. When I get a bad token sequence, I might not be informing IDEA about the range properly. I see this as the start of my relexing process:

text=a = "x";, start=0, stop=8
line 1:2 token recognition error at: '"x;' <--- antlr error to stderr.

Maybe all I need is a good example from somebody else's handbuilt parser about handling bad tokens?

Ter

Attachment(s):
idea-token-stream.png

Permanently deleted user

Created September 25, 2013 20:56

Jon, just noticed you are the Lua plugin genius! thanks for answering. I'm looking at your excellent plug in right this second. :)
Ter

Permanently deleted user

Created September 25, 2013 21:43

Ok, I think I have this figured out. When ANTLR finds a token error it notifies an error listener and then throws out that token looking for a good one. it does not want to send a bad token to the parser. For highlighting, we need to send that token. I got rid of the acception by sending a bogus token for invalid on terminated strings. I guess that there is no error message that I can send to Intellij. I just have to label that token as a bad token in the highlight will underline it? I see no error message in Lexer.

Permanently deleted user

Created September 25, 2013 23:01

I have this working now and I have a much better understanding of how intellij wants things to work. As someone pointed out to me, there are two lexers. One for highlighting and one for parsing. The lexer for highlighting must return all tokens good and bad. There must be a token to cover all characters and seems, which makes sense. The parser however should not see these bad tokens because it just confuses the issue. For example, if I insert a random & in an assignment like:

a = & "x";

The highlighters should highlight the & in red but the parser should see nothing wrong. Since I am using a single lexer, generated from ANTLR, I filter the tokens coming from lexer to strip out any bad tokens. This is similar to how the parser automatically strips comments and whitespace before sending the tokens to your parser.

Naturally, I have a complicated problem here because I must integrate and rectify ANTLR's mechanisms with that of intellij. ANTLR tries to throw away all bad token so the parser doesn't see them. I had to inject a special object to trap lexer exceptions and fake out the ANTLR lexer. Serious kung fu.

Whew! I believe I have error handling working okay now and will move on to navigation of source code in my little plug-in.

ultimately I will build a plug-in generator from in our grammar.

Ter

Jon Akhtar

Created September 26, 2013 02:17

Basically the job of the lexer is to turn the entire contents of the file into a token stream that is used as the representation of the document from then on you have to emit a stream that covers the entire file. You can't skip anything in the lexer step..

Typically your error message is created using PsiBuilder,error and creating an error node in the AST node tree.

On top of it you can build the ast and psi in your parser, and do lexer level highlighting. Since antlr is a combined lexer / parser you have a small problem interfacing with the API's, because they expect a separate lexer and parser. See how they did it in grammerkit. I remember looking at ANTLR and then deciding that it would be a difficult task to interface it to the openapi due to the need to be able to always go from psi -> ast -> token and back - and get the same result every time. There can't be any inconsistency - that consistency is what allows you to function as a semantically aware editor, and is one of the most important concepts in the openapi.

If you are outputting a token stream in your parser definition, but you aren't using it to build your psi tree, that means you are ignoring the modififcations the openapi is making to bind the whitespace tokens and comments and lazy nodes to your psi tree.

Permanently deleted user

Created September 27, 2013 19:26

Hi Jon, actually ANTLR has the combined lexer parser specification but splits them into two generated objects. The integration between the two is pretty smooth once I understood the protocol, as you've described. It did require some expertise inside the ANTLR runtime, however, so that would have been very annoying for someone outside of the ANTLR team.

The way I have it now, a highlighting lexer tokenizes everything and returns bad tokens even. For the parsing interface, I use a variation on the lexer that does not return bad tokens. I have an adapter then that asks the builder for tokens via advanceLexer() which I then convert back to ANTLR tokens. I hope this is not violating the consistency you talk about.

My parser calls builder.mark() at the start of every rule and calls .done(rulename) at the end of every rule function in the parser.

Thanks,
Ter

Colin Fleming

Created September 27, 2013 22:20

Hmm, you might need to be careful here, if I've understood you correctly - the parsing lexer still has to return tokens that completely cover the file (i.e. no gaps). This is one of the most significant differences from a traditional compiler parser/lexer. That might be the cause of your segment errors if you're not doing that.

It's been a while since I worked with ANTLR (at one point I was considering a project similar to what you're doing now), do your tokens extend a particular concrete class or implement a particular interface? Can you customise that? Your life will be much much easier if you can create IntelliJ style tokens (IElementTypes) directly.

Also, if you're calling mark() and done() after every rule, you'll get AST nodes for all of them, which may or may not be what you want. Again, it's been a while but I remember thinking that the ANTLR AST generation might be difficult to integrate into the IntelliJ world. Note that the AST layer is not the same as the PSI - you can build a separate PSI layer which doesn't include all the AST nodes, but IIRC in ANTLR you can create AST nodes that don't directly represent tokens in the parse process. I don't know of a way to do this in IntellliJ.

Permanently deleted user

Created September 29, 2013 22:42

Interesting. Yeah, I definitely do not pass bad tokens out of my lexer when lexing for the parser. That could explain why some of my stuff seems to be one character off, as I put in one bad character to test it. Crap. I'll have to filter those tokens before they get to ANTLR.

ANTLR v4 has a Token interface and a token factory. The differences I create actual token objects not just token types. And, my token types are integers not objects.

Re: mark/done, It's the only way to automated unless people tell me which rules they want. It's easier in my experience just to create an entire parse tree. v4 does not create ASTs like all previous versions. It only creates a parse tree and does so automatically. All of those imaginary nodes in the AST no longer makes sense since it is building parse trees only.

It is not clear to me how one would create a PSI model that does not include all parse tree nodes. Doesn't createElement in the parser definition want a simple one-to-one translation from parse tree node to PSI node?

Thanks so much for helping me out here. I would try to keep putting these little nuggets into my documentation to help others following in my footsteps

http://antlr.org/wiki/display/~admin/Intellij+plugin+development+notes

Colin Fleming

Created September 29, 2013 23:41

Ok, if you're filtering tokens that will definitely give you weirdness. Unfortunately AFAIK there's no way to avoid that.

Sorry, I don't quite understand what you mean by your token types - are you returning int or an object instance containing an integer? If it's an object could you not just return IElementTypes instead? IElementType is a little strange (and you're right that it's confusing that they refer to both lexer tokens and AST nodes) but you can consider the set of IElementTypes returned from your lexer as essentially equivalent to an enum, so if you created an array of the possible types could you index them based on the integer you would have returned, or something similar?

Re: mark/done - right, I haven't looked at GrammarKit for a while (although the recent threads have inspired me to go back and look at it again) but IIRC it has a mechanism to state which rules should result in PSI nodes, and possibly AST nodes too, I'm not sure. I don't think you'll have a problem creating a lot of AST nodes, except that it might complicate some of the plugin logic later if you have to traverse a lot of internal nodes to get to the elements you need. This is another way that a plugin differs significantly from a traditional use of a parser - in something like a compiler typically you'd parse, create your tree and then do at most a couple of passes over the tree. In the plugin the tree may be traversed many many times - during resolution, inspections and so forth.

The parser definition doesn't necessarily want a one-to-one mapping - the restriction is that for a particular AST node type, you have to create the same PSI node type, but you're free to not create PSI nodes for your internal AST nodes if you don't want to.

Thanks for doing up that document, too, it looks great - hopefully we can get that integrated into the JetBrains doc for the good of everyone.

Jon Akhtar

Created September 30, 2013 01:16

By the way, If there is any way that PsiViewer could make this any easier - let Colin and I know. There are already some features to help you write unit tests using IDEA's unit testing framework.

I love making and using tools for developers - so if there is a killer feature for PsiViewer you think is missing, I am sure I'll want to have it myself.

-Jon

Permanently deleted user

Created October 02, 2013 20:01

Hi Colin! I took your advice and re-examined my token filtering. It turns out that it was going to be a pain to get ANTLR to properly ignore bad tokens but also let IDEA see the bad tokens. the easiest solution worked very well which was to include my bad tokens into my whitespace class of tokens. It's "wrong" but works so well. ha!

I now have separated out all of the generic ANTLR adapter code into four classes that are small and one class that needs to be generated from a grammar. from here, I need to build the plug-in generator and also the ANTLR v4 plug-in itself. I just got sidetracked by an academic paper deadline I need to finish by November 15 but hopefully I will sneak in some fun programming time between now and then.

I'll have to look at making a PSI tree that does not include all syntax tree nodes. Can I simply have createElement in the parser definition return null for nodes I don't want in the PSI?

I have updated my document was more information but will hopefully get a chance to add more as I learn the API.

thanks for all your help! Hopefully my plug-in generator for ANTLR-based grammars will be useful to you someday.

Permanently deleted user

Created October 02, 2013 20:06

Hi Jon! PsiViewer is indeed supercool. I don't have any new real recommendations at this point, but you might consider doing a real tree view of the PSI. I collaborated with this supersmart German guy to create an optimal tree layout that is very dense. Please see the attached syntax tree for some R programming language input. It's all free software in ANTLR v4. Here is the tree viewer:

https://github.com/antlr/antlr4/blob/master/runtime/Java/src/org/antlr/v4/runtime/tree/gui/TreeViewer.java

ANTLR v4 runtime library includes the compiled jar from Udo Borkowski:

https://code.google.com/p/treelayout/

Ter

Attachment(s):
R-parse-tree.pdf

Please sign in to leave a comment.