String literal handling with JFlex


I am new to plugin development in IntelliJ. I have finished the simple properties language example and read the documentation for custom languages. Now I am trying a few more complex examples. In the example for the properties language the Lexer always returns instances of IElementType:

{ yybegin(YYINITIAL); return FlexTypes.CRLF; }

However when I try to introduce string literals like described in the JFlex manual or in [1], these examples always use a stringbuffer to build up the content of the string literal. Then when the lexing process of the literal is finished a Symbol is returned which contains the result of the string buffer. How does this concept map to the IElementType class of IntelliJ, it seems I must always return a type of IElement class, so how do I store the content of the string literal inside the IElementType?

I also have a second question, is there any JavaDoc available for the IntelliJ Plugin API (specifically the Custom Language API part, or is the information at [2] all there is)?


Comment actions Permalink

You don't store any data in the token returned from the lexer. Every lexer token has an associated start and end offset, and those can be used to retrieve the corresponding text from the file. Once you create PSI elements for your language, you can access the text using PsiElement.getText().

Many of the classes in the OpenAPI do have javadocs, but they aren't published separately. You can access them by checking out the IntelliJ IDEA Community Edition source code from and attaching it to your project.

Comment actions Permalink

Hi Dmitry,

thanks for you answer. I will try to explain more what I am trying to do. So here is an example. I tried to extend the simple properties language that is shown in the documentation [1] by a gramma that allows to inject Java code, but putting it into opening and closing parentheses. But in the documentation I found the following sentence: "The lexer of the enclosing language needs to return the entire fragment of the embedded language as a single chameleon token, of the type defined by the embedded language.". In the example below you can see I commented out the return statements, if I uncomment them, then in the PSI viewer I can see that the JAVA string actually consists of many elements in the tree, and not a single token. That is why I tried to modify the Lexer and lex the Java string into a single token, but that does not quite work, the Java string ends up being added to the opening parenthesis of type BLOCK_LEFT_BRACE.

    {CRLF}                      { yybegin(YYINITIAL); return FlexTypes.CRLF; }
    {WHITE_SPACE}+              { yybegin(WAITING_VALUE); return TokenType.WHITE_SPACE; }
    "{"                         { braceCounter=0; yybegin(JAVA_CODE); return FlexTypes.BLOCK_LEFT_BRACE; }

    {WHITE_SPACE}+              { yybegin(JAVA_CODE); /*return FlexTypes.JAVA_CODE; */}
    "{"                         { braceCounter++; yybegin(JAVA_CODE); /*return FlexTypes.JAVA_CODE;*/ }
    "}"                         { braceCounter--; if(braceCounter < 0) { yybegin(BLOCK_END); yypushback(1); return FlexTypes.JAVA_CODE; } else {yybegin(JAVA_CODE); /*return FlexTypes.JAVA_CODE;*/ } }
    .                           { yybegin(JAVA_CODE); /*return FlexTypes.JAVA_CODE;*/ }

    "}"                         { yybegin(WAITING_VALUE); return FlexTypes.BLOCK_RIGHT_BRACE; }
    .                           { return TokenType.BAD_CHARACTER; }


On the other hand if I uncomment the return statements, then the tree looks like this:


So I am wondering how I can build the Lexer in such a way that the JAVA code is lexed as a single token?



Comment actions Permalink


basically, the answer is simple but I dared to use myself, because I'm not sure whether I can make it bullet-proof. The lexer needs to work consistently even if the user types completetly wrong code.

So when I understood your intention, then I think you had the right idea. In Mathematica the comment delimiter are (* and *) and when I use the following lexer-spec for being *in a comment* I get what you want

"(*"         { yypushstate(IN_COMMENT);}
[^\*\)\(]*   { }
"*)"         { yypopstate(); if(yystate() == YYINITIAL) return MathematicaElementTypes.COMMENT; }
[\*\)\(]     { }
.            { return MathematicaElementTypes.BAD_CHARACTER; }

Ignore all the pushstate stuff; this is only because in this language nested comments are allowed and I need to take care of this because users really use it. What is important is that as long as I read a comment I never return a token. This means the lexer reads and reads and eats up characters as long as my patterns match. When I'm finished with everything, which means all nested comments are done and reached the last closing *) then I return one COMMENT token. This makes that IDEA recognizes the comment-content as one token. I could make it more perfect, because now the opening (* is still matched as separate token, but I created this example only for you.

This approach has a very serious drawback! Assume the user starts a comment and doesn't end it. Then the lexer eats and eats characters, but it *never* returns a token! Basically, for the rest of the file IDEA has no idea what kind of text it is looking at. Highlighting goes wrong, etc, etc.. If you, on the other hand, return a token on every match you make, then all this will not happen, because every part of the text is tokenized correctly.  Since I thought there must be a better jflex solution to this, I asked this question long time ago on Stack Overflow with no results.

Finally, I do have a tip for you. Look at my code above. What you can do in your case is to match everything as long as you don't hit a { or a }. With this, you will get big chunks of java parts.

@Dmitry Did you never came across something similar?



Please sign in to leave a comment.